======== Artifact ======== .. todo:: This tutorial is very out of date and needs to be overhauled. The basic concepts are still correct, but the code examples are not. A data artifact is a bundle of input data associated with a particular model. It is typically stored as an ``hdf`` file on disk with very particular formatting. This file is then used by the :mod:`vivarium` simulations to fill in all the relevant parameter data. It is frequently useful to be able to view or modify this data outside the simulation. The :class:`vivarium.framework.artifact.artifact.Artifact` provides a high level interface to do just that. In this tutorial we'll go through how to view, delete, and write data to an artifact using the tools provided by the :class:`~vivarium.framework.artifact.artifact.Artifact`. You'll access data in the artifact through keys, mirroring the underlying hdf storage of artifacts. .. contents:: :depth: 1 :local: :backlinks: none Creating an artifact --------------------- To view an existing hdf file via the :class:`~vivarium.framework.artifact.artifact.Artifact` tools, we'll create a new artifact. We can print the resulting artifact to view the tree structure of the keys in our artifact. We'll use our test artifact to illustrate: .. code-block:: python from vivarium import Artifact art = Artifact('test_artifact.hdf') print(art) :: Artifact containing the following keys: metadata keyspace locations versions population age_bins structure theoretical_minimum_risk_life_expectancy Now we have an :class:`~vivarium.framework.artifact.artifact.Artifact` object, which we can use to interact with the data stored in the hdf file with which we created it. Filter Terms +++++++++++++ The data stored in artifacts may be large, potentially on the order of millions of rows for a single dataset, and loading a full dataset requires time and memory, both of which may be limited. If you are only interested in certain subsets of the data you may want to read only the portion you need. This is the idea behind filter terms. Filter terms are built into an :class:`~vivarium.framework.artifact.artifact.Artifact` on its creation and apply to all data loaded from that Artifact. You can think of filter terms as somewhat similar to the :meth:`pandas.DataFrame.query` method, although the key difference is that filter terms apply to what data is actually read off disk. This means that they can reduce the time and memory required to load a single dataset from an Artifact. Filter terms should be specified as a list of strings, with each item in the list corresponding to a single filter. This allows multiple filters to be applied to a single Artifact. These terms are combined logically using 'AND', so filter terms of ``['draw == 0', 'year_start > 2010', 'age_start < 5']`` would mean only return rows with ``draw == 0 AND year_start > 2010 AND age_start < 5``. Note that if some data stored in your Artifact does not contain the column or columns included in your filter terms, the non-applicable filter terms will be skipped for that data. So if a dataset in an Artifact created with the draw, year_start, and age_start filter terms only included a draw column, only ``draw == 0`` would be applied to that data. Here's how we would construct an Artifact with the draw, year_start, and age_start filters we just described: .. code-block:: python from vivarium import Artifact art = Artifact('test_artifact.hdf', filter_terms=['draw == 0', 'year_start > 2005', 'age_start <= 5']) print(art) :: Artifact containing the following keys: metadata keyspace locations versions population age_bins structure theoretical_minimum_risk_life_expectancy Note that the keys in the artifact are unchanged. The filter terms only affect data when it is loaded out of the artifact. Keys +++++ Artifacts store data under keys. Each key is of the form ``..``, e.g., "cause.all_causes.restrictions" or ``.``, e.g., "population.structure." To view all keys in an artifact, use the ``keys`` attribute of the artifact: .. code-block:: python art.keys :: ['metadata.keyspace', 'metadata.locations', 'metadata.versions', 'population.age_bins', 'population.structure', 'population.theoretical_minimum_risk_life_expectancy'] Reading data ------------- Now that we've seen how to create an :class:`~vivarium.framework.artifact.artifact.Artifact` object and view the underlying storage structure, let's cover how to actually retrieve data from that artifact. We'll use the :meth:`~vivarium.framework.artifact.artifact.Artifact.load` method. We saw the key names in our artifact in the previous step, and we'll use those names to load data. For example, if we want to load the population structure data from our Artifact we do: .. code-block:: python art = Artifact('test_artifact.hdf') pop = art.load('population.structure') print(pop.head())) :: value age_end age_start location sex year_end year_start 0.019178 0.0 Ethiopia Female 2007 2006 25610.50 Male 2012 2011 29136.66 2009 2008 27492.91 Female 2000 1999 22157.50 1993 1992 19066.45 Notice that if we construct our artifact with filter terms as discussed above, we'll filter the data that gets loaded out of it: .. code-block:: python art = Artifact('test_artifact.hdf', filter_terms=['age_start > 5']) pop = art.load('population.structure') print(pop.head())) :: value age_end age_start location sex year_end year_start 15.0 10.0 Ethiopia Male 2011 2010 6009393.00 2003 2002 4489336.99 Female 2016 2015 6424674.99 Male 2017 2016 6610845.00 Female 2006 2005 4922733.99 We can only load keys that already exist in the Artifact, however. If we try to load a key not present in our Artifact, we will get an error: .. code-block:: python art.load('a.fake.key') :: Traceback (most recent call last): File "", line 1, in File "/home/kate/code/vivarium/vivarium/src/vivarium/framework/artifact/artifact.py", line 75, in load raise ArtifactException(f"{entity_key} should be in {self.path}.") vivarium.framework.artifact.ArtifactException: a.fake.key should be in tests/dataset_manager/artifact.hdf. Writing data ------------ To write new data to an artifact, use the :meth:`~vivarium.framework.artifact.artifact.Artifact.write` method, passing the full key (in the string representation we saw above of ``type.name.measure`` or ``type.measure``) and the data you wish to store. .. code-block:: python new_data = ['United States', 'Washington', 'California'] art.write('locations.names', new_data) if 'locations.names' in art: print('Successfully Added!') :: Successfully Added! What if the key we wish to write to is already present in the data? Let's see what happens if we try to write again to the ``locations.names`` key we just wrote to. We get an error: .. code-block:: python art.write('locations.names', ['New York', 'Florida']) :: Traceback (most recent call last): File "", line 1, in File "/home/kate/code/vivarium/vivarium/src/vivarium/framework/artifact/artifact.py", line 105, in write raise ArtifactException(f'{entity_key} already in artifact.') vivarium.framework.artifact.ArtifactException: locations.names already in artifact. If the key you want to write to is already in the artifact, you'll want to use the :meth:`~vivarium.framework.artifact.artifact.Artifact.replace` method instead of :meth:`~vivarium.framework.artifact.artifact.Artifact.write`. This allows you to replace the data in the artifact at the given key with the passed data. .. code-block:: python updated_data = ['Texas', 'Oregon'] art.replace('locations.names', updated_data) print(art.load('locations.names')) :: ['Texas', 'Oregon'] Removing data ------------- Like :meth:`~vivarium.framework.artifact.artifact.Artifact.load` and :meth:`~vivarium.framework.artifact.artifact.Artifact.write`, :meth:`~vivarium.framework.artifact.artifact.Artifact.remove` is based on keys. Pass the name of the key you wish to remove, and it will be deleted from the artifact and the underlying hdf file. .. code-block:: python art.remove('locations.names') if not 'locations.names' in art: print('Successfully Deleted!') :: Successfully Deleted!