Using externally obtained data
Using externally obtained data: the replay actuator¶
The replay actuator allows you to leverage results that were obtained via experiments outside of ado
, and are contained in external sources like CSV files. We can't repeat these experiments, or add new data using them, in ado
as no actuator exits to do so. However, you still might want to define measurement spaces with them so entities that have the relevant data can be sampled and the data used, perhaps in a custom objective function.
The talking a random walk tutorial uses external data and the replay actuator.
Importing data from a CSV¶
Often external data is stored in a CSV or table where each row contains measurement results for some entity. One set of columns defines the entity (the thing being measured) and another set of columns the results of one or more experiments on the entity.
To use this data with ado
the first step is to copy it into a samplestore
at creation time. When copying this data into the samplestore
the columns containing measured values (observed properties) and which columns containing constitutive properties are defined. With this information ado
can create entities for each row. The following example is from talking a random walk:
# Copyright (c) IBM Corporation
# SPDX-License-Identifier: MIT
metadata:
name: "ml_multi_cloud"
description: "samplestore initialised with ml_multi_cloud data"
specification:
module:
moduleName: orchestrator.core.samplestore.sql
moduleClass: SQLSampleStore
copyFrom:
- module:
moduleClass: CSVSampleStore
moduleName: orchestrator.core.samplestore.csv
storageLocation:
path: 'examples/ml-multi-cloud/ml_export.csv'
parameters:
generatorIdentifier: 'multi-cloud-ml'
identifierColumn: 'config'
constitutivePropertyColumns:
- cpu_family
- vcpu_size
- nodes
- provider
experiments:
- experimentIdentifier: 'benchmark_performance'
propertyMap:
wallClockRuntime: 'wallClockRuntime'
status: 'status'
The copyFrom
section is where the external sources data should be copied into the samplestore
are defined. There can be multiple but here there is just one.
The relevant fields are:
module
: These are the values you set to indicate the data is in a CSV filestorageLocation
: This is the path the CSV fileparameters.identifierColumn
: This is the column in the CSV, if any, to use as the identifier of the created entities.parameters.constitutivePropertyColumns
: This is a list of the columns in the CSV file that define the constitutive properties of the entitiesexperiments
: This section defines the experiments that were used to generate the data in the CSV fileexperiments.experimentIdentifier
: This is the name for the experiment in adoexperiments.propertyMap
: This is a dictionary mapping the names of the properties experiment as they will appear inado
to column names in the CSV
The above YAML says to associate the data in the columns wallClockRuntime
and status
with an experiment 'benchmark_performance' that measures properties with the same name.
The propertyMap
field allows you to handle column headers had names that are not suitable for names of properties. For example if there was a column with measurements on a molecule called Real_pKa (-0.83, 10.58)
, you might want to associate this with a property called pka
instead:
propertyMap:
pka: "Real_pKa (-0.83, 10.58)"
Using the external data in a discoveryspace
¶
If you copied entities from an external source to $SAMPLE_STORE_IDENTIFIER and in the process defined an external experiment called my_experiment
then you can use it in a discoveryspace
with:
sampleStoreIdentifier: $SAMPLE_STORE_IDENTIFIER
experiments:
- actuatorIdentifier: replay
experimentIdentifier: my_experiment
The ml multi cloud example uses this approach.
How the replay
actuator works¶
Looking at the example in Importing data from a CSV, you might wonder how ado
can use it, if it does not have an actuator that provides the experiment benchmark_performance
!
What happens is that when a measurement of an experiment associated with the replay
actuator is requested to be performed on an entity, if the data is present (because it was copied in) it is reused as normal by ado
's memoization mechanism. If there is no data, it can't be measured as no real experiment exists, and the replay
actuator handles this case correctly - it creates the No value to replay
messages seen here.
Important
To use external data via the replay actuator the relevant operator must be configured to use memoization. With the randomwalk and ray_tune operators this means singleMeasurement parameter is set to True (the default).