Population Structures and Fertility

vivarium.public_health provides several components for creating and managing simulated populations. This tutorial demonstrates the minimal configuration required for each approach.

Overview

There are two categories of population components:

Initial population components create the starting set of simulants when the simulation begins:

  • BasePopulation - the standard component that samples simulants from demographic data.

  • ScaledPopulation - a variant that rescales the demographic data before sampling.

Fertility components add new simulants during the simulation:

Note

BasePopulation includes three sub-components: Mortality, AgeOutSimulants, and Disability. You do not need to add these.

Common Setup

Population components load their data through the data_sources configuration pattern (see Data sources). The Expected Data Layout section shows the key names and column layouts for every data key so that you know exactly what format your data should have.

Every code example in this tutorial uses two helpers imported from vivarium.public_health._example_data:

from vivarium.public_health._example_data import BASE_PLUGINS, make_base_config

# BASE_PLUGINS configures the data plugin to serve example data from memory.
# Pass it as plugin_configuration to InteractiveContext.
base_plugins = BASE_PLUGINS

# make_base_config() returns a configuration with sensible defaults for
# time range, step size, and randomness key columns.
config = make_base_config()

Expected Data Layout

This section documents the key name and column layout that each population component expects. Some components also support a data_sources configuration pattern that lets you override individual keys with a scalar, DataFrame, or callable (see Data sources).

Data keys

The table below lists every data key used by the population components. All of them can be overridden in the data_sources section of the configuration (see Data sources); the data key shown is simply the default.

Key

Index columns

Value columns

Used by

Configuration override

population.structure

age, sex, year, location

value (population count)

BasePopulation, ScaledPopulation

population.population_structure

population.location

(scalar)

A string (e.g. "Kenya")

BasePopulation

population.location

cause.all_causes.cause_specific_mortality_rate

age, sex, year

value (rate)

Mortality

mortality.data_sources.all_cause_mortality_rate

population.theoretical_minimum_risk_life_expectancy

age

value (years of remaining life)

Mortality

mortality.data_sources.life_expectancy

covariate.live_births_by_sex.estimate

year, sex, parameter

value

FertilityCrudeBirthRate

fertility.data_sources.live_births_by_sex

covariate.age_specific_fertility_rate.estimate

age, sex, year, parameter

value

FertilityAgeSpecificRates

fertility_age_specific_rates.data_sources.age_specific_fertility_rate

Data sources

Some components support a data_sources configuration pattern that lets you override individual data keys. This is especially useful during development or for simple tutorial examples like the ones in this page. Components that support it declare their data needs in configuration_defaults; by default each key points to the corresponding data key string. You can override any of them with:

  • Scalar (int or float) - broadcast a constant value to all simulants.

  • DataFrame - use the DataFrame directly.

  • Callable - call the function at setup time to produce the data.

  • Data key (string) - load a different key from the data plugin.

For example, Mortality declares three configurable data sources:

# Default configuration (loads from the data plugin):
mortality:
  data_sources:
    all_cause_mortality_rate: "cause.all_causes.cause_specific_mortality_rate"
    life_expectancy: "population.theoretical_minimum_risk_life_expectancy"
    unmodeled_cause_specific_mortality_rate: <internal method>

Note

The unmodeled_cause_specific_mortality_rate default is shown as <internal method> because it is a bound Python method that cannot be expressed in YAML.

Any of these can be overridden in the simulation configuration:

# Override with a scalar - no data key lookup needed:
configuration:
  mortality:
    data_sources:
      all_cause_mortality_rate: 0.01
      life_expectancy: 80.0

The component sections below show the first few rows of the data each component expects, so you can see the concrete layout.

BasePopulation

BasePopulation is the standard way to create an initial population. It loads a population structure and samples simulants whose age, sex, and location distributions match the source data.

Data consumed by BasePopulation

BasePopulation and its sub-components use the following data. The examples below show the expected column layout. The data builders come from _example_data.

from vivarium.public_health._example_data import (
    population_structure,
    theoretical_minimum_risk_life_expectancy,
)

# population.structure - population counts per demographic cell.
pop_structure = population_structure()
print(pop_structure.query("year_start == 1990").head(6).to_string(index=False))
 age_start  age_end    sex  year_start  year_end location     value
  0.000000 0.019178   Male        1990      1991    Kenya  1.917808
  0.000000 0.019178 Female        1990      1991    Kenya  1.917808
  0.019178 0.076712   Male        1990      1991    Kenya  5.753425
  0.019178 0.076712 Female        1990      1991    Kenya  5.753425
  0.076712 1.000000   Male        1990      1991    Kenya 92.328767
  0.076712 1.000000 Female        1990      1991    Kenya 92.328767
# population.location - a scalar string identifying the simulated location.
# In the example data this is the string "Kenya".

# population.theoretical_minimum_risk_life_expectancy - remaining life
# expectancy by age, used by the Mortality sub-component to compute years
# of life lost. Indexed only by age (no sex, year, or location).
tmrle = theoretical_minimum_risk_life_expectancy()
print(tmrle.head(5).to_string(index=False))
 age_start  age_end  value
       0.0      1.0   98.0
       1.0      2.0   98.0
       2.0      3.0   98.0
       3.0      4.0   98.0
       4.0      5.0   98.0

Overriding the default data

By default, every data source loads from the artifact (configured via BASE_PLUGINS in these examples). You can bypass the artifact and supply data directly through data_sources - pass a DataFrame, callable, or literal string:

import pandas as pd
from vivarium.engine import InteractiveContext
from vivarium.public_health.population import BasePopulation
from vivarium.public_health._example_data import population_structure

# Build population structure data (same layout as the data key).
pop_data = population_structure()

config = make_base_config()
config.update(
    {
        "population": {
            "population_size": 5_000,
            "data_sources": {
                "population_structure": pop_data,
                "location": "Kenya",
            },
        },
        "mortality": {"data_sources": {"all_cause_mortality_rate": 0}},
    },
    layer="override",
)

sim = InteractiveContext(
    components=[BasePopulation()],
    configuration=config,
    plugin_configuration=base_plugins,
)

pop = sim.get_population(["age", "sex", "location"])
assert len(pop) == 5_000
assert (pop["location"] == "Kenya").all()
print(f"Population: {len(pop)}, location: {pop['location'].iloc[0]}")
Population: 5000, location: Kenya

Default configuration

The absolute minimum is a population_size. Everything else has sensible defaults (ages 0-125, both sexes, no age-out):

from vivarium.engine import InteractiveContext
from vivarium.public_health.population import BasePopulation

config = make_base_config()
config.update(
    {
        "population": {
            "population_size": 10_000,
        },
        # Override mortality to zero so simulants don't die during
        # this demonstration.
        "mortality": {"data_sources": {"all_cause_mortality_rate": 0}},
    },
    layer="override",
)

sim = InteractiveContext(
    components=[BasePopulation()],
    configuration=config,
    plugin_configuration=base_plugins,
)

pop = sim.get_population(["age", "sex", "location"])
assert len(pop) == 10_000
assert pop["age"].min() >= 0
assert pop["age"].max() <= 125
assert set(pop["sex"].unique()) == {"Male", "Female"}
print(f"Population: {len(pop)}")
Population: 10000

Custom age range

Use initialization_age_min and initialization_age_max to restrict the age range of the initial population. This is the most common customization:

config = make_base_config()
config.update(
    {
        "population": {
            "population_size": 10_000,
            "initialization_age_min": 0,
            "initialization_age_max": 5,
        },
        "mortality": {"data_sources": {"all_cause_mortality_rate": 0}},
    },
    layer="override",
)

sim = InteractiveContext(
    components=[BasePopulation()],
    configuration=config,
    plugin_configuration=base_plugins,
)

pop = sim.get_population(["age"])
assert pop["age"].min() >= 0
assert pop["age"].max() < 5
print(f"All ages in [0, 5): {pop['age'].min() >= 0 and pop['age'].max() < 5}")
All ages in [0, 5): True

Single-age initialization (newborns)

When initialization_age_min equals initialization_age_max, all simulants start at the same age. This can be used with fertility components to represent a cohort of newborns:

config = make_base_config()
config.update(
    {
        "population": {
            "population_size": 1_000,
            "initialization_age_min": 0,
            "initialization_age_max": 0,
        },
        "mortality": {"data_sources": {"all_cause_mortality_rate": 0}},
    },
    layer="override",
)

sim = InteractiveContext(
    components=[BasePopulation()],
    configuration=config,
    plugin_configuration=base_plugins,
)

pop = sim.get_population(["age"])
# All simulants are newborns; ages are smoothed within the first time step.
assert pop["age"].max() < 1.0
print(f"All simulants under 1 year old: {pop['age'].max() < 1.0}")
All simulants under 1 year old: True

Filtering by sex

The include_sex option restricts the population to a single sex. Valid values are "Male", "Female", or "Both" (the default):

config = make_base_config()
config.update(
    {
        "population": {
            "population_size": 10_000,
            "include_sex": "Female",
        },
        "mortality": {"data_sources": {"all_cause_mortality_rate": 0}},
    },
    layer="override",
)

sim = InteractiveContext(
    components=[BasePopulation()],
    configuration=config,
    plugin_configuration=base_plugins,
)

pop = sim.get_population(["sex"])
assert (pop["sex"] == "Female").all()
print(f"All Female: {len(pop)}")
All Female: 10000

Aging out of a simulation

Setting untracking_age causes simulants to be removed from the tracked population once they reach that age (see the vivarium population concepts documentation for more on untracking). This is useful when a model only cares about a specific age window. The is_aged_out column is populated by the AgeOutSimulants sub-component when untracking_age is set:

config = make_base_config()
config.update(
    {
        "population": {
            "population_size": 10_000,
            "initialization_age_min": 4,
            "initialization_age_max": 4,
            "untracking_age": 5,
        },
        "time": {"step_size": 100},
        "mortality": {"data_sources": {"all_cause_mortality_rate": 0}},
    },
    layer="override",
)

sim = InteractiveContext(
    components=[BasePopulation()],
    configuration=config,
    plugin_configuration=base_plugins,
)

# All 4-year-olds at the start
print(f"Tracked: {len(sim.get_population(['age']))}")
Tracked: 10000
# After taking 6 steps of 100 days (~1.6 years), everyone has aged past 5
sim.take_steps(number_of_steps=6)
pop = sim.get_population(["is_aged_out", "exit_time"], include_untracked=True)
print(f"Aged out: {pop['is_aged_out'].sum()}")
Aged out: 10000

Configuration summary for BasePopulation

Key

Default

Description

population.population_size

10000

Number of simulants to create.

population.population_structure

internal method (loads population.structure)

Population structure data. Accepts a DataFrame, callable, or data key.

population.location

internal method (loads population.location)

Location string. Accepts a scalar string, callable, or data key.

population.initialization_age_min

0

Minimum age (years) for the initial population.

population.initialization_age_max

125

Maximum age (years) for the initial population.

population.include_sex

"Both"

"Male", "Female", or "Both".

population.untracking_age

None

Age at which simulants are removed from the tracked population. None means no age-out.

mortality.data_sources.all_cause_mortality_rate

"cause.all_causes.cause_specific_mortality_rate"

All-cause mortality rate. Accepts a scalar, DataFrame, callable, or data key.

mortality.data_sources.life_expectancy

"population.theoretical_minimum_risk_life_expectancy"

Remaining life expectancy by age. Accepts a scalar, DataFrame, callable, or data key.

mortality.data_sources.unmodeled_cause_specific_mortality_rate

internal method

CSMR for unmodeled causes. Accepts a scalar, DataFrame, callable, or data key.

ScaledPopulation

ScaledPopulation works like BasePopulation but multiplies the population structure by a scaling factor before sampling. This is useful when simulants represent a subset of the real population (for example, only the population eligible for an intervention).

The scaling factor can be either a pandas.DataFrame with the same demographic index as the population structure, or a string data key that resolves to such a DataFrame.

ScaledPopulation uses the same data sources as BasePopulation (see Data consumed by BasePopulation) plus a user-supplied scaling factor. The scaling factor can be passed as a pandas.DataFrame directly or as a string data key:

import numpy as np
import pandas as pd
from vivarium.engine import InteractiveContext
from vivarium.public_health.population import ScaledPopulation
from vivarium.public_health._example_data import population_structure

config = make_base_config()
config.update(
    {
        "population": {
            "population_size": 100_000,
            "include_sex": "Both",
        },
        "mortality": {"data_sources": {"all_cause_mortality_rate": 0}},
    },
    layer="override",
)

# Build a scaling factor DataFrame - same demographic index as
# population.structure, with a ``value`` per cell. Each cell's
# population count is multiplied by its scaling value.
scalar_data = (
    population_structure()
    .query("year_start == 1990")
    .drop(columns=["location"])
    .copy()
)
scalar_data["value"] = np.linspace(0.5, 2.0, len(scalar_data))
print(scalar_data.head(6).to_string(index=False))
 age_start  age_end    sex  year_start  year_end    value
  0.000000 0.019178   Male        1990      1991 0.500000
  0.000000 0.019178 Female        1990      1991 0.533333
  0.019178 0.076712   Male        1990      1991 0.566667
  0.019178 0.076712 Female        1990      1991 0.600000
  0.076712 1.000000   Male        1990      1991 0.633333
  0.076712 1.000000 Female        1990      1991 0.666667
# Pass the DataFrame directly.
sim = InteractiveContext(
    components=[ScaledPopulation(scalar_data)],
    configuration=config,
    plugin_configuration=base_plugins,
)

pop = sim.get_population(["age", "sex"])
assert len(pop) == 100_000
assert set(pop["sex"].unique()) == {"Male", "Female"}
print(f"Population: {len(pop)}, sexes: {sorted(pop['sex'].unique())}")
Population: 100000, sexes: ['Female', 'Male']

Fertility Components

Fertility components add new simulants during the simulation to model births. They are paired with a population component such as BasePopulation.

Note

All three fertility components create newborns with age_start=0 and age_end=0, meaning new simulants enter the simulation as newborns.

FertilityDeterministic

FertilityDeterministic adds a fixed number of new simulants each year. This is the simplest fertility model and does not require any external data.

from vivarium.engine import InteractiveContext
from vivarium.public_health.population import BasePopulation, FertilityDeterministic

config = make_base_config()
config.update(
    {
        "population": {
            "population_size": 1_000,
            "initialization_age_min": 0,
            "initialization_age_max": 100,
        },
        "time": {"step_size": 10},
        "mortality": {"data_sources": {"all_cause_mortality_rate": 0}},
        "fertility": {"number_of_new_simulants_each_year": 500},
    },
    layer="override",
)

sim = InteractiveContext(
    components=[BasePopulation(), FertilityDeterministic()],
    configuration=config,
    plugin_configuration=base_plugins,
)

sim.take_steps(number_of_steps=10)
pop = sim.get_population(["age"])
# Population grew from 1000 by ~500 * (100/365) ≈ 137 new simulants.
assert len(pop) > 1_000
print(f"Population grew: {len(pop) > 1_000}")
Population grew: True

FertilityCrudeBirthRate

FertilityCrudeBirthRate models births at the population level using a crude birth rate - the number of live births per unit of population, regardless of age or sex structure. Because it does not consider the demographic composition of the population, the number of births depends only on the total population size and the overall birth rate. This contrasts with FertilityAgeSpecificRates, which models births at the individual level using rates that vary by age.

It requires initialization_age_min to be 0.

The live_births_by_sex data should contain a row for each year × sex combination:

from vivarium.public_health._example_data import live_births_by_sex

# covariate.live_births_by_sex.estimate - each row gives the number of
# live births for a year × sex combination.
print(live_births_by_sex().head(6).to_string(index=False))
 year_start  year_end    sex  parameter  value
       1990      1991 Female mean_value  500.0
       1990      1991   Male mean_value  500.0
       1991      1992 Female mean_value  500.0
       1991      1992   Male mean_value  500.0
       1992      1993 Female mean_value  500.0
       1992      1993   Male mean_value  500.0

Both data sources can be supplied via configuration:

from vivarium.engine import InteractiveContext
from vivarium.public_health.population import BasePopulation, FertilityCrudeBirthRate
from vivarium.public_health._example_data import population_structure, live_births_by_sex

config = make_base_config()
config.update(
    {
        "population": {
            "population_size": 10_000,
            "initialization_age_min": 0,
            "initialization_age_max": 125,
            "data_sources": {
                "population_structure": population_structure(),
                "location": "Kenya",
            },
        },
        "time": {"step_size": 10},
        "mortality": {"data_sources": {"all_cause_mortality_rate": 0}},
        "fertility": {
            "data_sources": {
                "population_structure": population_structure(),
                "live_births_by_sex": live_births_by_sex(),
            },
        },
    },
    layer="override",
)

sim = InteractiveContext(
    components=[BasePopulation(), FertilityCrudeBirthRate()],
    configuration=config,
    plugin_configuration=base_plugins,
)

sim.take_steps(number_of_steps=10)
pop = sim.get_population(["age"])
assert len(pop) > 10_000
print(f"Population grew: {len(pop) > 10_000}")
Population grew: True

Important

FertilityCrudeBirthRate requires initialization_age_min to be 0. It will raise a ValueError if this is not the case.

FertilityAgeSpecificRates

FertilityAgeSpecificRates models fertility at the individual level. Each living female simulant who has not given birth in the last nine months has a chance of giving birth determined by age-specific fertility rates. Newborns are linked to their parent via a parent_id column.

The expected data shape is one row per age × year × sex × parameter combination:

from vivarium.public_health._example_data import age_specific_fertility_rate

# covariate.age_specific_fertility_rate.estimate - each row gives a
# fertility rate for an age × year × sex × parameter cell.
asfr_data = age_specific_fertility_rate(rate=0.05)
print(asfr_data.head(6).to_string(index=False))
 year_start  year_end  age_start  age_end    sex   parameter  value
       1990      1991        0.0 0.019178 Female  mean_value   0.05
       1990      1991        0.0 0.019178 Female lower_value   0.05
       1990      1991        0.0 0.019178 Female upper_value   0.05
       1990      1991        0.0 0.019178   Male  mean_value   0.05
       1990      1991        0.0 0.019178   Male lower_value   0.05
       1990      1991        0.0 0.019178   Male upper_value   0.05

The tutorial example below supplies a constant rate directly:

from vivarium.engine import InteractiveContext
from vivarium.public_health.population import BasePopulation, FertilityAgeSpecificRates

config = make_base_config()
config.update(
    {
        "population": {
            "population_size": 1_000,
            "initialization_age_min": 0,
            "initialization_age_max": 125,
        },
        "time": {"step_size": 10},
        "mortality": {"data_sources": {"all_cause_mortality_rate": 0}},
        # Override the fertility rate via data_sources configuration.
        "fertility_age_specific_rates": {
            "data_sources": {
                "age_specific_fertility_rate": 0.05,
            },
        },
    },
    layer="override",
)

sim = InteractiveContext(
    components=[BasePopulation(), FertilityAgeSpecificRates()],
    configuration=config,
    plugin_configuration=base_plugins,
)

sim.take_steps(number_of_steps=100)
pop = sim.get_population(["age", "parent_id", "last_birth_time"])

# Newborns have a parent_id pointing to their mother
newborns = pop[pop["parent_id"] >= 0]
assert len(newborns) > 0
print(f"Births occurred: {len(newborns) > 0}")
Births occurred: True

Fertility configuration summary

Component

Configuration key

Default data keys

Notes

FertilityDeterministic

fertility.number_of_new_simulants_each_year

None

Simplest model; fixed birth count. Pure configuration.

FertilityCrudeBirthRate

fertility.data_sources.population_structure, fertility.data_sources.live_births_by_sex, fertility.time_dependent_live_births, fertility.time_dependent_population_fraction

covariate.live_births_by_sex.estimate, population.structure (defaults)

Requires initialization_age_min == 0. Supports data_sources overrides (DataFrame, callable, data key).

FertilityAgeSpecificRates

fertility_age_specific_rates.data_sources.age_specific_fertility_rate

covariate.age_specific_fertility_rate.estimate (default)

Supports data_sources overrides (scalar, DataFrame, callable). Tracks parent-child relationships.