Artifact

Todo

This tutorial is very out of date and needs to be overhauled. The basic concepts are still correct, but the code examples are not.

A data artifact is a bundle of input data associated with a particular model. It is typically stored as an hdf file on disk with very particular formatting. This file is then used by the vivarium simulations to fill in all the relevant parameter data.

It is frequently useful to be able to view or modify this data outside the simulation. The vivarium.framework.artifact.artifact.Artifact provides a high level interface to do just that. In this tutorial we’ll go through how to view, delete, and write data to an artifact using the tools provided by the Artifact. You’ll access data in the artifact through keys, mirroring the underlying hdf storage of artifacts.

Creating an artifact

To view an existing hdf file via the Artifact tools, we’ll create a new artifact. We can print the resulting artifact to view the tree structure of the keys in our artifact. We’ll use our test artifact to illustrate:

from vivarium import Artifact

art = Artifact('test_artifact.hdf')
print(art)

Artifact containing the following keys:
metadata
        keyspace
        locations
        versions
population
        age_bins
        structure
        theoretical_minimum_risk_life_expectancy

Now we have an Artifact object, which we can use to interact with the data stored in the hdf file with which we created it.

Filter Terms

The data stored in artifacts may be large, potentially on the order of millions of rows for a single dataset, and loading a full dataset requires time and memory, both of which may be limited. If you are only interested in certain subsets of the data you may want to read only the portion you need. This is the idea behind filter terms.

Filter terms are built into an Artifact on its creation and apply to all data loaded from that Artifact. You can think of filter terms as somewhat similar to the pandas.DataFrame.query() method, although the key difference is that filter terms apply to what data is actually read off disk. This means that they can reduce the time and memory required to load a single dataset from an Artifact.

Filter terms should be specified as a list of strings, with each item in the list corresponding to a single filter. This allows multiple filters to be applied to a single Artifact. These terms are combined logically using ‘AND’, so filter terms of ['draw == 0', 'year_start > 2010', 'age_start < 5'] would mean only return rows with draw == 0 AND year_start > 2010 AND age_start < 5. Note that if some data stored in your Artifact does not contain the column or columns included in your filter terms, the non-applicable filter terms will be skipped for that data. So if a dataset in an Artifact created with the draw, year_start, and age_start filter terms only included a draw column, only draw == 0 would be applied to that data.

Here’s how we would construct an Artifact with the draw, year_start, and age_start filters we just described:

from vivarium import Artifact

art = Artifact('test_artifact.hdf', filter_terms=['draw == 0', 'year_start > 2005', 'age_start <= 5'])
print(art)

Artifact containing the following keys:
metadata
        keyspace
        locations
        versions
population
        age_bins
        structure
        theoretical_minimum_risk_life_expectancy

Note that the keys in the artifact are unchanged. The filter terms only affect data when it is loaded out of the artifact.

Keys

Artifacts store data under keys. Each key is of the form <type>.<name>.<measure>, e.g., “cause.all_causes.restrictions” or <type>.<measure>, e.g., “population.structure.” To view all keys in an artifact, use the keys attribute of the artifact:

art.keys

['metadata.keyspace', 'metadata.locations', 'metadata.versions', 'population.age_bins',
 'population.structure', 'population.theoretical_minimum_risk_life_expectancy']

Reading data

Now that we’ve seen how to create an Artifact object and view the underlying storage structure, let’s cover how to actually retrieve data from that artifact. We’ll use the load() method.

We saw the key names in our artifact in the previous step, and we’ll use those names to load data. For example, if we want to load the population structure data from our Artifact we do:

art = Artifact('test_artifact.hdf')
pop = art.load('population.structure')
print(pop.head()))

                                                           value
age_end  age_start location sex    year_end year_start
0.019178 0.0       Ethiopia Female 2007     2006        25610.50
                            Male   2012     2011        29136.66
                                   2009     2008        27492.91
                            Female 2000     1999        22157.50
                                   1993     1992        19066.45

Notice that if we construct our artifact with filter terms as discussed above, we’ll filter the data that gets loaded out of it:

art = Artifact('test_artifact.hdf', filter_terms=['age_start > 5'])
pop = art.load('population.structure')
print(pop.head()))

                                                            value
age_end age_start location sex    year_end year_start
15.0    10.0      Ethiopia Male   2011     2010        6009393.00
                                  2003     2002        4489336.99
                           Female 2016     2015        6424674.99
                           Male   2017     2016        6610845.00
                           Female 2006     2005        4922733.99

We can only load keys that already exist in the Artifact, however. If we try to load a key not present in our Artifact, we will get an error:

art.load('a.fake.key')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kate/code/vivarium/vivarium/src/vivarium/framework/artifact/artifact.py", line 75, in load
    raise ArtifactException(f"{entity_key} should be in {self.path}.")
vivarium.framework.artifact.ArtifactException: a.fake.key should be in tests/dataset_manager/artifact.hdf.

Writing data

To write new data to an artifact, use the write() method, passing the full key (in the string representation we saw above of type.name.measure or type.measure) and the data you wish to store.

new_data = ['United States', 'Washington', 'California']

art.write('locations.names', new_data)

if 'locations.names' in art:
    print('Successfully Added!')

Successfully Added!

What if the key we wish to write to is already present in the data? Let’s see what happens if we try to write again to the locations.names key we just wrote to. We get an error:

art.write('locations.names', ['New York', 'Florida'])

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kate/code/vivarium/vivarium/src/vivarium/framework/artifact/artifact.py", line 105, in write
    raise ArtifactException(f'{entity_key} already in artifact.')
vivarium.framework.artifact.ArtifactException: locations.names already in artifact.

If the key you want to write to is already in the artifact, you’ll want to use the replace() method instead of write(). This allows you to replace the data in the artifact at the given key with the passed data.

updated_data = ['Texas', 'Oregon']
art.replace('locations.names', updated_data)
print(art.load('locations.names'))

['Texas', 'Oregon']

Removing data

Like load() and write(), remove() is based on keys. Pass the name of the key you wish to remove, and it will be deleted from the artifact and the underlying hdf file.

art.remove('locations.names')

if not 'locations.names' in art:
    print('Successfully Deleted!')

Successfully Deleted!