Lookup Tables

Simulations tend to require a large quantity of data to run. A completely reasonable way to look at a simulation is to think of it as a task of getting the right data and the right random numbers in the appropriate place at the appropriate time. To address the first concern, vivarium provides the Lookup Table abstraction to ensure that the right data can be retrieved when it’s needed. In particular, it attempts to wrap different strategies for constructing interpolations or distributions on data such that a user simply needs to request values for a set of simulants when they’re needed. This idea is extended to compositions of of several data-based values by vivarium’s values system.

The Lookup Table

A Lookup Table for a quantity is a callable object that is built from a scalar value or a pandas.DataFrame of data points that describes how the quantity varies with other variable(s). It is called with a pandas.Index as a function parameter which represents a simulated population as discussed in the population concept. When called, the Lookup Table simply returns appropriate values of the quantity for the population it was passed, interpolating if necessary or extrapolating if configured to. This behavior represents the standard interface for asking for data about a population in a simulation.

The lookup table system is built in layers. At the top is the Lookup Table object which is responsible for providing a uniform interface to the user regardless of the underlying data. From the user’s perspective, it takes in a data set or scalar value on initialization and then lets them query against that data with a population index.

At initialization time, the Lookup Table examines the provided data and configures itself accordingly. If the data is a scalar value (or list/tuple of scalars), the table simply broadcasts those values over the population index when called. If the data is a pandas.DataFrame, the table delegates to an Interpolation object that handles both categorical and continuous parameter lookups. The Interpolation groups the data by any categorical (key) columns and then, for each group, finds the correct bin for any continuous parameters. Tables with only categorical parameters are simply the special case where there are no continuous parameters to bin on.

Note

The Interpolation name is somewhat of a misnomer. For order 0 (the only currently supported order), the operation is really disaggregation – finding the correct bin a value belongs to rather than interpolating between points. If the system is extended to work with point estimates for continuous parameters, then interpolation would appropriately describe the operation.

More information about the value production strategies can be found in here.

Construction Parameters

A lookup table is defined for a set of categorical variables, continuous variables, and the values that depend on those variables. The lookup table calls these variables keys, parameters, and values, respectively.

key: A categorical variable, such as sex, that a quantity depends on.
parameter: A continuous variable, such as age, that a quantity depends on. This data frequently represents bins for which values are defined.
value: Known values of the quantity of interest, which vary with the keys and parameters.

Along with data about these variables, A lookup table is instantiated with the corresponding column names which are used to query an internal population view when the table itself is called. This means the lookup table only needs to be called with a population index – it gathers the population information it needs itself. It also means the data must be available in the population state table with the same column name.

In the table below is an example of (unrealistic) data that could be used to create a lookup table for a quantity of interest about a population, in this case, Body Mass Index (BMI). We may find ourselves in a situation where we want to know the BMI of a simulant in order to make a treatment decision. If we construct a lookup table with these data, we can cleanly get the information we want and go on implementing our treatment. When called, the lookup table will return values of BMI for the simulants defined by the population index.

Key	Parameter		Value
sex	age_start	age_end	BMI
Male	0	20	20
Male	20	40	25
Male	40	60	30
Male	60	100	27
Female	0	20	20
Female	20	40	25
Female	40	60	30
Female	60	100	27

Constructing Lookup Tables from a Component

Components can build lookup tables as needed via the build_lookup_table() method which will refer to the data_sources block in the component’s configuration_defaults property. As a basic example, DiseaseModel in vivarium_public_health has the following data_sources configuration:

@property
def configuration_defaults(self) -> dict[str, Any]:
    return {
        f"{self.name}": {
            "data_sources": {
                "cause_specific_mortality_rate": self.load_cause_specific_mortality_rate,
            },
        },
    }

which specifies that when building a lookup table named cause_specific_mortality_rate, the data should be provided by the component’s load_cause_specific_mortality_rate method.

Each entry in data_sources maps a table name to a data source from one of several supported types (see Data Source Types).

This approach allows users to override data sources in model specification files without modifying component code. Following the example above, a model specification could adjust the cause_specific_mortality_rate data source to point to different data or a scalar value:

configuration:
    disease_model:
        data_sources:
            cause_specific_mortality_rate: 0.02

Data Source Types

Each entry in data_sources maps a table name to a data source. The following data source types are supported:

Artifact key (string without :: ):: A string path to data in the artifact, e.g., "cause.all_causes.cause_specific_mortality_rate". The data is loaded via builder.data.load(). Strings with :: are reserved for method or function references (see below).
Callable:: Any callable (function, lambda, or bound method) that accepts a builder argument and returns the data.
Scalar value:: A numeric value (int, float), datetime, or timedelta that will be broadcast over the population index when the table is called.
Method reference (string with self:: ):: A string of the form "self::method_name" that references a method on the component itself. The method should accept a builder argument and return the data. This is primarily for use in model specification YAML files where direct method references are not possible.
External function reference (string with module.path:: ):: A string of the form "module.path::function_name" that references a function in another module. The function should accept a builder argument and return the data. This is primarily for use in model specification YAML files where direct method references are not possible.

Column Detection

When building a lookup table from a pandas.DataFrame using data_sources, the component automatically determines key columns, parameter columns, and value columns based on the data structure:

Value columns can be provided as an argument to build_lookup_table() If value columns are not provided, it will default to "value".
Parameter columns are detected by finding columns ending in _start that have corresponding _end columns (e.g., age_start/age_end).
Key columns are all remaining columns that are neither value columns nor parameter bin edge columns.

See the Construction Parameters section for definitions of these column types.

Example: Writing a Component with Data Sources

A more complete example is reproduced from the Mortality component in vivarium_public_health:

from vivarium import Component

class Mortality(Component):

    @property
    def configuration_defaults(self) -> dict[str, Any]:
        return {
            "mortality": {
                "data_sources": {
                    # Artifact key - loaded via builder.data.load()
                    "all_cause_mortality_rate": "cause.all_causes.cause_specific_mortality_rate",
                    # Method reference - calls self.load_unmodeled_csmr(builder)
                    "unmodeled_cause_specific_mortality_rate": self.load_unmodeled_csmr,
                    # Another artifact key
                    "life_expectancy": "population.theoretical_minimum_risk_life_expectancy",
                },
                "unmodeled_causes": [],
            },
        }

Example: Configuring Data Sources as a User

Users can override the default data sources in a model specification file. This allows changing where data comes from without modifying component code:

configuration:
    mortality:
        data_sources:
            # Override with a scalar value instead of artifact data
            all_cause_mortality_rate: 0.01
            # point to a module function
            unmodeled_cause_specific_mortality_rate: "my_module.data::load_unmodeled_csmr"
            # Or point to different artifact data
            life_expectancy: "alternative.life_expectancy.data"

Using the Lookup Interface Directly

For cases not covered by data_sources, or when working in an interactive context, you can build lookup tables directly using the builder’s lookup interface.

Example Usage

The following is an example of creating and calling a lookup table in an interactive setting using the data from Construction Parameters above. The interface and process are the same when integrating a lookup table into a component, which is primarily how they are used. Assuming you have a valid simulation object named sim and the data from the above table in a pandas.DataFrame named data, you can construct a lookup table in the following way, using the interface from the builder. You don’t have to provide a name for the table, but it is recommended to do so for clarity and for ease of debugging. If you don’t provide value column names, it will default to "value".

  # value_columns implicitly set to remaining columns
> bmi = sim.builder.lookup.build_table(data, name="bmi")
> pop_index = sim.get_population_index()
> bmi(pop_index).head()  # returns BMI values for the population

  0     20.0
  1     20.0
  2     30.0
  3     27.0
  4     25.0
  Name: BMI, dtype: float64

Note

Constructing a lookup table currently requires your data meet specific conditions. These are a consequence of the method the lookup table uses to arrive at the correct data. Specifically, your parameter columns must represent bins and they must not overlap or have gaps.

Estimating Unknown Values

Interpolation

If a lookup table was constructed with a scalar value or values, the lookup call trivially returns the same scalar(s) back for any population passed in. However, if the lookup table was instead created with a pandas.DataFrame of varying data the lookup will perform interpolation which is an important feature. Interpolation is the process of estimating values for unspecified parameters within the bounds of the parameters we have defined in the lookup table. Currently, the most common case arises when the values are binned by the parameters. Then, the interpolation simply finds the correct bin a value belongs to. Please see the interpolation concept note for more in-depth information about the kinds of interpolation performed by the lookup table.

Extrapolation

Previously, we discussed interpolation as the process of estimating data within the bounds defined by our lookup table. What would happen if we wanted data outside of this range? Estimating such data is called extrapolation, and it can be performed using a lookup table as well. Extrapolation is a configurable option that, when enabled, allows a lookup data to provide values outside of the range it was created with. This is done by extending the edge points outwards to encompass outside points. This is a dumb but useful strategy and is primarily used to run simulations beyond the time bounds included in the data under the assumption that parameters do not change in the future.

Specifying Options in the Model Configuration

Configuring interpolation and extrapolation in a model specification is straightforward. Currently, the only acceptable value for order is 0. Extrapolation can be turned on and off.

configuration:
    interpolation:
        order: 0
        extrapolate: True