Pulling Data
vivarium_inputs provides an interface to pull data from GBD + auxiliary
data. Use this interface to examine data that you want to use in a model to
ensure it passes all validations and looks as you expect. You have 2 choices
for pulling data:
vivarium_inputs.interface.get_measure()will pull data through the same process as simulations do, which will perform the same validations, transformations, and standardizations as occur for data that is used in a simulation.
vivarium_inputs.interface.get_raw_data()will pull raw data from GBD, skipping all validations in order to explore and investigate.
Both of the above methods can retrieve entity-measure data (e.g.,
prevalence data for a cause or exposure for a risk factor), population structure,
and life expectancy. Functions to retrieve data about the extents of certain
demographic variables – vivarium_inputs.interface.get_age_bins() and
vivarium_inputs.interface.get_demographic_dimensions() are somewhat
orthogonal and imply the same data modifications inherent in
calling vivarium_inputs.interface.get_measure().
Note
The data returned by the interface may occasionally change format as
vivarium_inputs is updated and/or the actual underlying GBD data
changes. Therefore, the examples we provide below may not exactly match
what you see.
Which should I use… get_measure() versus get_raw_data()
Typically, you should prefer get_measure over
get_raw_data.
get_measure will produce
simulation-prepped data. If get_measure
fails, or the data it returns doesn’t match your expectations, then
get_raw_data might provide some
insight into what is happening.
Pulling Simulation-Prepped Data
For simulation-prepped data, the interface provides separate methods to pull entity-measure data, population structure, and life expectancy data. Additionally, methods to pull age bin data and demographic dimensions are provided. Simulation-prepped data has had GBD IDs replaced with meaningful values or ranges and expansion over all demographic dimensions has been performed. We’ll walk through how to pull data using each of these functions.
Entity-Measure Data
The interface provides get_measure
for pulling location-specific measure data for an entity (e.g. a cause from
gbd_mapping). The measure is the descriptor of the data you want to pull
(e.g., ‘prevalence’ or ‘relative_risk’) - a list of possible measures for each entity
type is included in the table below.
Note
To pull simulation-prepped entity-measure data, you must have plenty of available memory - please request at least 50GB.
Note
The simulation-prepped data returned by get_measure
has all demographic and year values set as the index with only draw-level
data as columns.
For example, to pull prevalence data for diarrheal diseases in Kenya, we would do the following:
from gbd_mapping import causes
from vivarium_inputs import get_measure
prev = get_measure(
entity=causes.diarrheal_diseases,
measure='prevalence',
location='Kenya',
data_type="draws",
)
print(prev.head())
draw_0 ... draw_499
location sex age_start age_end year_start year_end ...
Kenya Female 0.000000 0.019178 2021 2022 0.018762 ... 0.018243
0.019178 0.076712 2021 2022 0.041142 ... 0.041379
0.076712 0.500000 2021 2022 0.040640 ... 0.042404
0.500000 1.000000 2021 2022 0.026530 ... 0.029795
1.000000 2.000000 2021 2022 0.011624 ... 0.014232
The following table lists the measures available for each entity kind:
Entity Kind |
Measures |
|---|---|
sequela |
incidence
prevalence
birth_prevalence
disability_weight
|
cause |
incidence
prevalence
birth_prevalence
disability_weight
remission
cause_specific_mortality
excess_mortality
|
risk_factor |
exposure
exposure_standard_deviation
exposure_distribution_weights
relative_risk
population_attributable_fraction
mediation_factors
|
alternative_risk_factor |
exposure
exposure_standard_deviation
exposure_distribution_weights
|
etiology |
population_attributable_fraction
|
covariate |
estimate
|
healthcare_entity |
cost
utilization
|
health_technology |
cost
|
Population Structure Data
To pull population data for a specific location, vivarium_inputs.interface
provides get_population_structure,
which returns population data in the input format expected by a simulation.
For example, to pull population data for Kenya, we would do the following:
from vivarium_inputs import get_population_structure
pop = get_population_structure(location='Kenya')
print(pop.head())
value
location sex age_start age_end year_start year_end
Kenya Female 0.000000 0.019178 2021 2022 10995.345135
0.019178 0.076712 2021 2022 32740.129897
0.076712 0.500000 2021 2022 241157.325386
0.500000 1.000000 2021 2022 283195.389282
1.000000 2.000000 2021 2022 575233.481802
Life Expectancy Data
To pull life expectancy data, vivarium_inputs.interface provides
get_theoretical_minimum_risk_life_expectancy,
which returns life expectancy data in the input format expected by a simulation.
Because life expectancy is not location specific, the function takes no arguments.
To use:
from vivarium_inputs import get_theoretical_minimum_risk_life_expectancy
life_exp = get_theoretical_minimum_risk_life_expectancy()
print(life_exp.head())
value
age_start age_end
0.00 0.01 89.958040
0.01 0.02 89.975474
0.02 0.03 89.990990
0.03 0.04 89.985077
0.04 0.05 89.979164
Age Bin Data
To see what age bins GBD uses that are used in age-specific data, vivarium_inputs
provides get_age_bins, which returns
the start, end, and name of each GBD age bin expected to appear in age-specific data
(with the exception of life expectancy, which uses its own age ranges).
from vivarium_inputs import get_age_bins
age_bins = get_age_bins()
print(age_bins.reset_index().head())
age_start age_end age_group_name
0 0.000000 0.019178 Early Neonatal
1 0.019178 0.076712 Late Neonatal
2 0.076712 0.500000 1-5 months
3 0.500000 1.000000 6-11 months
4 1.000000 2.000000 12 to 23 months
Demographic Dimensions Data
Finally, to view the full extent of all demographic dimensions that is expected
in input data to the simulation, vivarium_inputs provides
get_demographic_dimensions,
which expects a location argument to fill the location dimension.
from vivarium_inputs import get_demographic_dimensions
dem_dims = get_demographic_dimensions(location='Kenya')
print(dem_dims.reset_index().head())
location sex age_start age_end year_start year_end
0 Kenya Female 0.000000 0.019178 2021 2022
1 Kenya Female 0.019178 0.076712 2021 2022
2 Kenya Female 0.076712 0.500000 2021 2022
3 Kenya Female 0.500000 1.000000 2021 2022
4 Kenya Female 1.000000 2.000000 2021 2022
Pulling Raw GBD Data
The interface provides get_raw_data,
which can be used to pull entity-measure data as well as population structure and life
expectancy. Raw validation checks are not performed to return data that can
be investigated for oddities. The only filtering that occurs is by applicable
measure id, metric id, or to most detailed causes where relevant. No formatting
or reshaping of the data is done. The following sections detail how to pull each
type of data.
Entity-Measure Data
The interface provides get_raw_data
for pulling specific raw measure data for an entity for a single location from GBD,
without the prep work that occurs on data for a simulation.
entity should be a gbd_mapping.base_template.ModelableEntity (e.g.,
a cause from gbd_mapping), while measure should be a string
describing the measure for which you want to retrieve data (e.g., ‘prevalence’
or ‘relative_risk’). A list of possible measures for each entity
kind is included in the table below. Finally, location should be the string
location for which you want to pull data (e.g., ‘Ethiopia’), in the form used by
GBD (e.g., ‘United States’ instead of ‘USA’).
For example, to pull draw-level raw prevalence data for diarrheal diseases in Kenya, we would do the following:
from gbd_mapping import causes
from vivarium_inputs import get_raw_data
prev = get_raw_data(
entity=causes.diarrheal_diseases,
measure='prevalence',
location='Kenya',
data_type="draws",
)
print(prev.head())
age_group_id cause_id draw_0 ... year_id metric_id version_id
50 2 302 0.018762 ... 2021 3 1471
51 3 302 0.041142 ... 2021 3 1471
52 6 302 0.014616 ... 2021 3 1471
53 7 302 0.023237 ... 2021 3 1471
54 8 302 0.024702 ... 2021 3 1471
The following table lists the measures available for each entity kind for pulling raw data:
Entity Kind |
Measures |
|---|---|
sequela |
incidence
prevalence
birth_prevalence
disability_weight
|
cause |
incidence
prevalence
birth_prevalence
disability_weight
remission
deaths
|
risk_factor |
exposure
exposure_standard_deviation
exposure_distribution_weights
relative_risk
population_attributable_fraction
mediation_factors
|
alternative_risk_factor |
exposure
exposure_standard_deviation
exposure_distribution_weights
|
etiology |
population_attributable_fraction
|
covariate |
estimate
|
healthcare_entity |
cost
utilization
|
health_technology |
cost
|
Population Structure Data
To pull raw population data for a specific location, we will actually use the same
get_raw_data function we used for
pulling entity-measure data, with a special Population entity.
For example, to pull population data for Kenya, we would do the following:
from vivarium_inputs import get_raw_data
from vivarium_inputs.globals import Population
pop = get_raw_data(entity=Population(), measure='structure', location='Kenya')
print(pop.head())
age_group_id location_id year_id sex_id population run_id
0 2 180 2021 1 1.145138e+04 359
1 3 180 2021 1 3.402961e+04 359
2 6 180 2021 1 3.187225e+06 359
3 7 180 2021 1 3.264795e+06 359
4 8 180 2021 1 2.997167e+06 359
Life Expectancy Data
Similarly to pull life expectancy data, we will use the same
get_raw_data function with the
special Population entity. Life expectancy data is not location-specific, so we’ll
just use the ‘Global’ location.
To use:
from vivarium_inputs import get_raw_data
from vivarium_inputs.globals import Population
life_exp = get_raw_data(Population(), 'theoretical_minimum_risk_life_expectancy', 'Global')
print(life_exp.head())
age life_expectancy
0 0.00 89.958040
1 0.01 89.975474
2 0.02 89.990990
3 0.03 89.985077
4 0.04 89.979164