Lookup Tables
Simulations tend to require a large quantity of data to run. A completely
reasonable way to look at a simulation is to think of it as a task of
getting the right data and the right random numbers in the appropriate
place at the appropriate time. To address the first concern,
vivarium
provides the
Lookup Table
abstraction
to ensure that the right data can be retrieved when it’s needed. In
particular, it attempts to wrap different strategies for constructing
interpolations or distributions on data such that a user simply needs to
request values for a set of simulants when they’re needed. This idea is
extended to compositions of of several data-based values by vivarium
’s
values system.
The Lookup Table
A Lookup Table
for a quantity is a callable object that is built from
a scalar value or a pandas.DataFrame
of data points that describes
how the quantity varies with other variable(s). It is called with a
pandas.Index
as a function parameter which represents a simulated
population as discussed in the population concept.
When called, the Lookup Table
simply returns appropriate values of the quantity for the population it was
passed, interpolating if necessary or extrapolating if configured to. This
behavior represents the standard interface for asking for data about a
population in a simulation.
The lookup table system is built in layers. At the top is the
Lookup Table
object which
is responsible for providing a uniform interface to the user regardless
of the underlying implementation. From the user’s perspective, it takes in
a data set or scalar value on initialization and then lets them query against
that data with a population index.
The next layer is selected at initialization time based on the type of data
provided. The Lookup Table
picks a ScalarTable
if a single value is provided as the data, a
CategoricalTable
if a
pandas.DataFrame
with only categorical variables is provided as the
data, and a InterpolatedTable
if a pandas.DataFrame
which has at least one continuous variable is
provided as the data.
Note
The InterpolatedTable
is a misnomer here. It confuses the data handling strategy with the
underlying data representation. A better name would be BinnedDataTable
to indicate that it wraps data where the continuous parameters are
represented by bin edges in the provided data. This would allow us
to easily think about and extend the lookup system to wrap data where the
continuous parameters are represented by points and to tables where all
parameters are categorical.
If the underlying data is a single value or consists only of categorical variables,
this is the last layer of abstraction. The
ScalarTable
and
CategoricalTable
each
have only one reasonable strategy which is to broadcast the value over the
population index. If we have continuous variables and therefore an
InterpolatedTable
,
there are additional layers to the lookup system to allow the user to
control the strategy for turning the population index into values based on
the data. The
InterpolatedTable
is then responsible for turning the population index into a set of
attributes relevant to the value production based on the structure of
the input data and then providing those attributes to the value production
strategy.
Note
I’m being careful with language here. We have objects named
Interpolation
and InterpolatedTable
though the operation they
perform is actually disaggregation. If we extend the system to
work with point estimates for the continuous parameters, then
interpolation would appropriately describe what we do. Both are
value production strategies based on the structure of the input data.
More information about the value production strategies can be found in here.
Construction Parameters
A lookup table is defined for a set of categorical variables, continuous variables, and the values that depend on those variables. The lookup table calls these variables keys, parameters, and values, respectively.
- key
A categorical variable, such as sex, that a quantity depends on.
- parameter
A continuous variable, such as age, that a quantity depends on. This data frequently represents bins for which values are defined.
- value
Known values of the quantity of interest, which vary with the keys and parameters.
Along with data about these variables, A lookup table is instantiated with the
corresponding column names which are used to query an internal
population view
when the table itself is called. This means the lookup table only needs to be
called with a population index – it gathers the population information it
needs itself. It also means the data must be available in the
population state table with the same column name.
In the table below is an example of (unrealistic) data that could be used to create a lookup table for a quantity of interest about a population, in this case, Body Mass Index (BMI). We may find ourselves in a situation where we want to know the BMI of a simulant in order to make a treatment decision. If we construct a lookup table with these data, we can cleanly get the information we want and go on implementing our treatment. When called, the lookup table will return values of BMI for the simulants defined by the population index.
Key |
Parameter |
Value |
|
---|---|---|---|
sex |
age_start |
age_end |
BMI |
Male |
0 |
20 |
20 |
Male |
20 |
40 |
25 |
Male |
40 |
60 |
30 |
Male |
60 |
100 |
27 |
Female |
0 |
20 |
20 |
Female |
20 |
40 |
25 |
Female |
40 |
60 |
30 |
Female |
60 |
100 |
27 |
Example Usage
The following is an example of creating and calling a lookup table in an
interactive setting using the data above. The
interface and process are the same when integrating a lookup table into a
component, which is primarily how they are used. Assuming
you have a valid simulation object named sim
and the data from the above
table in a pandas.DataFrame
named data
, you can construct a
lookup table in the following way, using the interface from the builder.
# value_columns implicitly set to remaining columns
> bmi = sim.builder.lookup.build_table(data, key_columns=['sex'], parameter_columns=['age'])
> population = sim.get_population()
> bmi(population.index).head() # returns BMI values for the population
0 20.0
1 20.0
2 30.0
3 27.0
4 25.0
Name: BMI, dtype: float64
Note
Constructing a lookup table currently requires your data meet specific conditions. These are a consequence of the method the lookup table uses to arrive at the correct data. Specifically, your parameter columns must represent bins and they must overlap.
Estimating Unknown Values
Interpolation
If a lookup table was constructed with a scalar value or values, the lookup
call trivially returns the same scalar(s) back for any population passed in.
However, if the lookup table was instead created with a
pandas.DataFrame
of varying data the lookup will perform interpolation
which is an important feature. Interpolation is the process of estimating
values for unspecified parameters within the bounds of the parameters we have
defined in the lookup table. Currently, the most common case arises when the
values are binned by the parameters. Then, the interpolation simply finds the
correct bin a value belongs to. Please see the
interpolation concept note for more in-depth
information about the kinds of interpolation performed by the lookup table.
Extrapolation
Previously, we discussed interpolation as the process of estimating data within the bounds defined by our lookup table. What would happen if we wanted data outside of this range? Estimating such data is called extrapolation, and it can be performed using a lookup table as well. Extrapolation is a configurable option that, when enabled, allows a lookup data to provide values outside of the range it was created with. This is done by extending the edge points outwards to encompass outside points. This is a dumb but useful strategy and is primarily used to run simulations beyond the time bounds included in the data under the assumption that parameters do not change in the future.
Specifying Options in the Model Configuration
Configuring interpolation and extrapolation in a model specification is straightforward. Currently, the only acceptable value for order is 0. Extrapolation can be turned on and off.
configuration:
interpolation:
order: 0
extrapolate: True