Pulling Data

vivarium_inputs provides an interface to pull data from GBD + auxiliary data. Use this interface to examine data that you want to use in a model to ensure it passes all validations and looks as you expect. You have 2 choices for pulling data:

Both of the above methods can retrieve entity-measure data (e.g., prevalence data for a cause or exposure for a risk factor), population structure, and life expectancy. Functions to retrieve data about the extents of certain demographic variables – vivarium_inputs.interface.get_age_bins() and vivarium_inputs.interface.get_demographic_dimensions() are somewhat orthogonal and imply the same data modifications inherent in calling vivarium_inputs.interface.get_measure().

Note

The data returned by the interface may occasionally change format as vivarium_inputs is updated and/or the actual underlying GBD data changes. Therefore, the examples we provide below may not exactly match what you see.

Which should I use… get_measure() versus get_raw_data()

Typically, you should prefer get_measure over get_raw_data. get_measure will produce simulation-prepped data. If get_measure fails, or the data it returns doesn’t match your expectations, then get_raw_data might provide some insight into what is happening.

Pulling Simulation-Prepped Data

For simulation-prepped data, the interface provides separate methods to pull entity-measure data, population structure, and life expectancy data. Additionally, methods to pull age bin data and demographic dimensions are provided. Simulation-prepped data has had GBD IDs replaced with meaningful values or ranges and expansion over all demographic dimensions has been performed. We’ll walk through how to pull data using each of these functions.

Entity-Measure Data

The interface provides get_measure for pulling location-specific measure data for an entity (e.g. a cause from gbd_mapping). The measure is the descriptor of the data you want to pull (e.g., ‘prevalence’ or ‘relative_risk’) - a list of possible measures for each entity type is included in the table below.

Note

To pull simulation-prepped entity-measure data, you must have plenty of available memory - please request at least 50GB.

Note

The simulation-prepped data returned by get_measure has all demographic and year values set as the index with only draw-level data as columns.

For example, to pull prevalence data for diarrheal diseases in Kenya, we would do the following:

from gbd_mapping import causes
from vivarium_inputs import get_measure

prev = get_measure(
    entity=causes.diarrheal_diseases,
    measure='prevalence',
    location='Kenya',
    data_type="draws",
)
print(prev.head())
                                                        draw_0  ...  draw_499
location sex    age_start age_end  year_start year_end            ...
Kenya    Female 0.000000  0.019178 2021       2022      0.018762  ...  0.018243
                0.019178  0.076712 2021       2022      0.041142  ...  0.041379
                0.076712  0.500000 2021       2022      0.040640  ...  0.042404
                0.500000  1.000000 2021       2022      0.026530  ...  0.029795
                1.000000  2.000000 2021       2022      0.011624  ...  0.014232

The following table lists the measures available for each entity kind:

Available Entity-Measure Pairs

Entity Kind

Measures

sequela

incidence
prevalence
birth_prevalence
disability_weight

cause

incidence
prevalence
birth_prevalence
disability_weight
remission
cause_specific_mortality
excess_mortality

risk_factor

exposure
exposure_standard_deviation
exposure_distribution_weights
relative_risk
population_attributable_fraction
mediation_factors

alternative_risk_factor

exposure
exposure_standard_deviation
exposure_distribution_weights

etiology

population_attributable_fraction

covariate

estimate

healthcare_entity

cost
utilization

health_technology

cost

Population Structure Data

To pull population data for a specific location, vivarium_inputs.interface provides get_population_structure, which returns population data in the input format expected by a simulation.

For example, to pull population data for Kenya, we would do the following:

from vivarium_inputs import get_population_structure

pop = get_population_structure(location='Kenya')
print(pop.head())
                                                                value
location sex    age_start age_end  year_start year_end
Kenya    Female 0.000000  0.019178 2021       2022       10995.345135
                0.019178  0.076712 2021       2022       32740.129897
                0.076712  0.500000 2021       2022      241157.325386
                0.500000  1.000000 2021       2022      283195.389282
                1.000000  2.000000 2021       2022      575233.481802

Life Expectancy Data

To pull life expectancy data, vivarium_inputs.interface provides get_theoretical_minimum_risk_life_expectancy, which returns life expectancy data in the input format expected by a simulation. Because life expectancy is not location specific, the function takes no arguments.

To use:

from vivarium_inputs import get_theoretical_minimum_risk_life_expectancy

life_exp = get_theoretical_minimum_risk_life_expectancy()
print(life_exp.head())
                       value
age_start age_end
0.00      0.01     89.958040
0.01      0.02     89.975474
0.02      0.03     89.990990
0.03      0.04     89.985077
0.04      0.05     89.979164

Age Bin Data

To see what age bins GBD uses that are used in age-specific data, vivarium_inputs provides get_age_bins, which returns the start, end, and name of each GBD age bin expected to appear in age-specific data (with the exception of life expectancy, which uses its own age ranges).

from vivarium_inputs import get_age_bins

age_bins = get_age_bins()
print(age_bins.reset_index().head())
   age_start   age_end   age_group_name
0   0.000000  0.019178   Early Neonatal
1   0.019178  0.076712    Late Neonatal
2   0.076712  0.500000       1-5 months
3   0.500000  1.000000      6-11 months
4   1.000000  2.000000  12 to 23 months

Demographic Dimensions Data

Finally, to view the full extent of all demographic dimensions that is expected in input data to the simulation, vivarium_inputs provides get_demographic_dimensions, which expects a location argument to fill the location dimension.

from vivarium_inputs import get_demographic_dimensions

dem_dims = get_demographic_dimensions(location='Kenya')
print(dem_dims.reset_index().head())
  location     sex  age_start   age_end  year_start  year_end
0    Kenya  Female   0.000000  0.019178        2021      2022
1    Kenya  Female   0.019178  0.076712        2021      2022
2    Kenya  Female   0.076712  0.500000        2021      2022
3    Kenya  Female   0.500000  1.000000        2021      2022
4    Kenya  Female   1.000000  2.000000        2021      2022

Pulling Raw GBD Data

The interface provides get_raw_data, which can be used to pull entity-measure data as well as population structure and life expectancy. Raw validation checks are not performed to return data that can be investigated for oddities. The only filtering that occurs is by applicable measure id, metric id, or to most detailed causes where relevant. No formatting or reshaping of the data is done. The following sections detail how to pull each type of data.

Entity-Measure Data

The interface provides get_raw_data for pulling specific raw measure data for an entity for a single location from GBD, without the prep work that occurs on data for a simulation.

entity should be a gbd_mapping.base_template.ModelableEntity (e.g., a cause from gbd_mapping), while measure should be a string describing the measure for which you want to retrieve data (e.g., ‘prevalence’ or ‘relative_risk’). A list of possible measures for each entity kind is included in the table below. Finally, location should be the string location for which you want to pull data (e.g., ‘Ethiopia’), in the form used by GBD (e.g., ‘United States’ instead of ‘USA’).

For example, to pull draw-level raw prevalence data for diarrheal diseases in Kenya, we would do the following:

from gbd_mapping import causes
from vivarium_inputs import get_raw_data

prev = get_raw_data(
    entity=causes.diarrheal_diseases,
    measure='prevalence',
    location='Kenya',
    data_type="draws",
)
print(prev.head())
    age_group_id  cause_id    draw_0  ...  year_id  metric_id  version_id
50             2       302  0.018762  ...     2021          3        1471
51             3       302  0.041142  ...     2021          3        1471
52             6       302  0.014616  ...     2021          3        1471
53             7       302  0.023237  ...     2021          3        1471
54             8       302  0.024702  ...     2021          3        1471

The following table lists the measures available for each entity kind for pulling raw data:

Available Entity-Measure Pairs

Entity Kind

Measures

sequela

incidence
prevalence
birth_prevalence
disability_weight

cause

incidence
prevalence
birth_prevalence
disability_weight
remission
deaths

risk_factor

exposure
exposure_standard_deviation
exposure_distribution_weights
relative_risk
population_attributable_fraction
mediation_factors

alternative_risk_factor

exposure
exposure_standard_deviation
exposure_distribution_weights

etiology

population_attributable_fraction

covariate

estimate

healthcare_entity

cost
utilization

health_technology

cost

Population Structure Data

To pull raw population data for a specific location, we will actually use the same get_raw_data function we used for pulling entity-measure data, with a special Population entity.

For example, to pull population data for Kenya, we would do the following:

from vivarium_inputs import get_raw_data
from vivarium_inputs.globals import Population

pop = get_raw_data(entity=Population(), measure='structure', location='Kenya')
print(pop.head())
   age_group_id  location_id  year_id  sex_id    population  run_id
0             2          180     2021       1  1.145138e+04     359
1             3          180     2021       1  3.402961e+04     359
2             6          180     2021       1  3.187225e+06     359
3             7          180     2021       1  3.264795e+06     359
4             8          180     2021       1  2.997167e+06     359

Life Expectancy Data

Similarly to pull life expectancy data, we will use the same get_raw_data function with the special Population entity. Life expectancy data is not location-specific, so we’ll just use the ‘Global’ location.

To use:

from vivarium_inputs import get_raw_data
from vivarium_inputs.globals import Population

life_exp = get_raw_data(Population(), 'theoretical_minimum_risk_life_expectancy', 'Global')
print(life_exp.head())
    age  life_expectancy
0  0.00        89.958040
1  0.01        89.975474
2  0.02        89.990990
3  0.03        89.985077
4  0.04        89.979164