Lookup Tables
Simulations tend to require a large quantity of data to run. A completely
reasonable way to look at a simulation is to think of it as a task of
getting the right data and the right random numbers in the appropriate
place at the appropriate time. To address the first concern,
vivarium provides the
Lookup Table abstraction
to ensure that the right data can be retrieved when it’s needed. In
particular, it attempts to wrap different strategies for constructing
interpolations or distributions on data such that a user simply needs to
request values for a set of simulants when they’re needed. This idea is
extended to compositions of of several data-based values by vivarium’s
values system.
The Lookup Table
A Lookup Table
for a quantity is a callable object that is built from
a scalar value or a pandas.DataFrame of data points that describes
how the quantity varies with other variable(s). It is called with a
pandas.Index as a function parameter which represents a simulated
population as discussed in the population concept.
When called, the Lookup Table
simply returns appropriate values of the quantity for the population it was
passed, interpolating if necessary or extrapolating if configured to. This
behavior represents the standard interface for asking for data about a
population in a simulation.
The lookup table system is built in layers. At the top is the
Lookup Table object which
is responsible for providing a uniform interface to the user regardless
of the underlying data. From the user’s perspective, it takes in a data set
or scalar value on initialization and then lets them query against that data
with a population index.
At initialization time, the
Lookup Table examines the
provided data and configures itself accordingly. If the data is a scalar value
(or list/tuple of scalars), the table simply broadcasts those values over the
population index when called. If the data is a pandas.DataFrame, the
table delegates to an
Interpolation
object that handles both categorical and continuous parameter lookups. The
Interpolation
groups the data by any categorical (key) columns and then, for each group,
finds the correct bin for any continuous parameters. Tables with only
categorical parameters are simply the special case where there are no
continuous parameters to bin on.
Note
The Interpolation name is somewhat of a misnomer. For order 0
(the only currently supported order), the operation is really
disaggregation – finding the correct bin a value belongs to rather
than interpolating between points. If the system is extended to work
with point estimates for continuous parameters, then interpolation
would appropriately describe the operation.
More information about the value production strategies can be found in here.
Construction Parameters
A lookup table is defined for a set of categorical variables, continuous variables, and the values that depend on those variables. The lookup table calls these variables keys, parameters, and values, respectively.
- key
A categorical variable, such as sex, that a quantity depends on.
- parameter
A continuous variable, such as age, that a quantity depends on. This data frequently represents bins for which values are defined.
- value
Known values of the quantity of interest, which vary with the keys and parameters.
Along with data about these variables, A lookup table is instantiated with the
corresponding column names which are used to query an internal
population view
when the table itself is called. This means the lookup table only needs to be
called with a population index – it gathers the population information it
needs itself. It also means the data must be available in the
population state table with the same column name.
In the table below is an example of (unrealistic) data that could be used to create a lookup table for a quantity of interest about a population, in this case, Body Mass Index (BMI). We may find ourselves in a situation where we want to know the BMI of a simulant in order to make a treatment decision. If we construct a lookup table with these data, we can cleanly get the information we want and go on implementing our treatment. When called, the lookup table will return values of BMI for the simulants defined by the population index.
Key |
Parameter |
Value |
|
|---|---|---|---|
sex |
age_start |
age_end |
BMI |
Male |
0 |
20 |
20 |
Male |
20 |
40 |
25 |
Male |
40 |
60 |
30 |
Male |
60 |
100 |
27 |
Female |
0 |
20 |
20 |
Female |
20 |
40 |
25 |
Female |
40 |
60 |
30 |
Female |
60 |
100 |
27 |
Constructing Lookup Tables from a Component
Components can build lookup tables as needed via the build_lookup_table()
method which will refer to the data_sources block in the component’s
configuration_defaults property. As a basic example,
DiseaseModel in vivarium_public_health has the following data_sources configuration:
@property
def configuration_defaults(self) -> dict[str, Any]:
return {
f"{self.name}": {
"data_sources": {
"cause_specific_mortality_rate": self.load_cause_specific_mortality_rate,
},
},
}
which specifies that when building a lookup table named
cause_specific_mortality_rate, the data should be provided by the component’s
load_cause_specific_mortality_rate method.
Each entry in data_sources maps a table name to a data source from one of
several supported types (see Data Source Types).
This approach allows users to override data sources in model specification files
without modifying component code. Following the example above, a model specification could adjust the
cause_specific_mortality_rate data source to point to different data or a scalar value:
configuration:
disease_model:
data_sources:
cause_specific_mortality_rate: 0.02
Data Source Types
Each entry in data_sources maps a table name to a data source. The
following data source types are supported:
- Artifact key (string without
::): A string path to data in the artifact, e.g.,
"cause.all_causes.cause_specific_mortality_rate". The data is loaded viabuilder.data.load(). Strings with::are reserved for method or function references (see below).- Callable:
Any callable (function, lambda, or bound method) that accepts a
builderargument and returns the data.- Scalar value:
A numeric value (
int,float),datetime, ortimedeltathat will be broadcast over the population index when the table is called.- Method reference (string with
self::): A string of the form
"self::method_name"that references a method on the component itself. The method should accept abuilderargument and return the data. This is primarily for use in model specification YAML files where direct method references are not possible.- External function reference (string with
module.path::): A string of the form
"module.path::function_name"that references a function in another module. The function should accept abuilderargument and return the data. This is primarily for use in model specification YAML files where direct method references are not possible.
Column Detection
When building a lookup table from a pandas.DataFrame using data_sources,
the component automatically determines key columns, parameter columns, and value columns
based on the data structure:
Value columns can be provided as an argument to
build_lookup_table()If value columns are not provided, it will default to"value".Parameter columns are detected by finding columns ending in
_startthat have corresponding_endcolumns (e.g.,age_start/age_end).Key columns are all remaining columns that are neither value columns nor parameter bin edge columns.
See the Construction Parameters section for definitions of these column types.
Example: Writing a Component with Data Sources
A more complete example is reproduced from the Mortality component in vivarium_public_health:
from vivarium import Component
class Mortality(Component):
@property
def configuration_defaults(self) -> dict[str, Any]:
return {
"mortality": {
"data_sources": {
# Artifact key - loaded via builder.data.load()
"all_cause_mortality_rate": "cause.all_causes.cause_specific_mortality_rate",
# Method reference - calls self.load_unmodeled_csmr(builder)
"unmodeled_cause_specific_mortality_rate": self.load_unmodeled_csmr,
# Another artifact key
"life_expectancy": "population.theoretical_minimum_risk_life_expectancy",
},
"unmodeled_causes": [],
},
}
Example: Configuring Data Sources as a User
Users can override the default data sources in a model specification file. This allows changing where data comes from without modifying component code:
configuration:
mortality:
data_sources:
# Override with a scalar value instead of artifact data
all_cause_mortality_rate: 0.01
# point to a module function
unmodeled_cause_specific_mortality_rate: "my_module.data::load_unmodeled_csmr"
# Or point to different artifact data
life_expectancy: "alternative.life_expectancy.data"
Using the Lookup Interface Directly
For cases not covered by data_sources, or when working in an interactive
context, you can build lookup tables directly using the builder’s lookup
interface.
Example Usage
The following is an example of creating and calling a lookup table in an
interactive setting using the data from
Construction Parameters above. The interface and process are the same when
integrating a lookup table into a component, which is primarily
how they are used. Assuming you have a valid simulation object named sim and
the data from the above table in a pandas.DataFrame named data, you
can construct a lookup table in the following way, using the interface from the builder.
You don’t have to provide a name for the table, but it is recommended to do so for clarity
and for ease of debugging. If you don’t provide value column names, it will default to
"value".
# value_columns implicitly set to remaining columns
> bmi = sim.builder.lookup.build_table(data, name="bmi")
> pop_index = sim.get_population_index()
> bmi(pop_index).head() # returns BMI values for the population
0 20.0
1 20.0
2 30.0
3 27.0
4 25.0
Name: BMI, dtype: float64
Note
Constructing a lookup table currently requires your data meet specific conditions. These are a consequence of the method the lookup table uses to arrive at the correct data. Specifically, your parameter columns must represent bins and they must not overlap or have gaps.
Estimating Unknown Values
Interpolation
If a lookup table was constructed with a scalar value or values, the lookup
call trivially returns the same scalar(s) back for any population passed in.
However, if the lookup table was instead created with a
pandas.DataFrame of varying data the lookup will perform interpolation
which is an important feature. Interpolation is the process of estimating
values for unspecified parameters within the bounds of the parameters we have
defined in the lookup table. Currently, the most common case arises when the
values are binned by the parameters. Then, the interpolation simply finds the
correct bin a value belongs to. Please see the
interpolation concept note for more in-depth
information about the kinds of interpolation performed by the lookup table.
Extrapolation
Previously, we discussed interpolation as the process of estimating data within the bounds defined by our lookup table. What would happen if we wanted data outside of this range? Estimating such data is called extrapolation, and it can be performed using a lookup table as well. Extrapolation is a configurable option that, when enabled, allows a lookup data to provide values outside of the range it was created with. This is done by extending the edge points outwards to encompass outside points. This is a dumb but useful strategy and is primarily used to run simulations beyond the time bounds included in the data under the assumption that parameters do not change in the future.
Specifying Options in the Model Configuration
Configuring interpolation and extrapolation in a model specification is straightforward. Currently, the only acceptable value for order is 0. Extrapolation can be turned on and off.
configuration:
interpolation:
order: 0
extrapolate: True