Taking a random walk
Note
This example illustrates:
-
Describing a set of points and how to measure them using a
discoveryspace
-
Exploring the
discoveryspace
by creating an operation that samples and measures the points -
Getting the results of an operation
The scenario¶
When you deploy a workload you have to choose values for workload parameters like the number of CPUs or the node type. To choose a combination of parameters that, for example, maximizes performance, a common strategy is to measure changes in performance by exploring the workload parameter space. This pattern applies to many domains where there is a parameter space to explore.
In this example ado
is used to explore the workload parameter space for a cloud application. To explore a workload parameter space you have to:
- define the values of the parameters to test - the parameter space
- define what to test them with - the experiment
- select points from the parameter space and perform the test - the sampling method
Here, we will use the simplest sampling method, random walk, where some number of points are randomly selected without replacement.
Caution
The commands below assume you are in the directory examples/ml-multi-cloud
in the ado source repository. See here for how to get the source repository.
Using pre-existing data with ado
¶
For this example we will use some pre-existing data. This makes the example simpler and quicker to execute but can also be useful in other situations. The data is in the file ml_export.csv
and is consists of results of running a benchmark on different cloud hardware configurations from different providers.
In ado
such configurations are called entities
, and are stored, along with the results of measurements executed on them, in a samplestore
. Let's start by copying the data in ml_export.csv
into a new samplestore
.
To do this execute,
ado create samplestore -f ml_multicloud_sample_store.yaml --set "copyFrom[0].storageLocation.path"=ml_export.csv
and it will report that a samplestore
has been created:
Success! Created sample store with identifier $SAMPLE_STORE_IDENTIFIER
samplestore
resource identifier printed by this command for the next section. Also try ado get samplestores
and you will see an entry for the one you just created
Info
You only need to create this samplestore
once. It can be reused in multiple discoveryspaces
or examples that require the ml_export.csv
data.
Creating a discoveryspace
for the ml-multi-cloud
data¶
A discoveryspace
describes a set of points and how to measure them. Here we will create a discoveryspace
to describe the space explored in ml_export.csv
.
Execute:
ado create space -f ml_multicloud_space.yaml --set "sampleStoreIdentifier=$SAMPLE_STORE_IDENTIFIER"
$SAMPLE_STORE_IDENTIFIER
is the identifier you copied in last step. This will confirm the creation of the discoveryspace
with:
Success! Created space with identifier: $DISCOVERY_SPACE_IDENTIFIER
You can now describe the discoveryspace
with:
ado describe space $DISCOVERY_SPACE_IDENTIFIER
discoveryspace
resource that was just created. This will output: Identifier: space-65cf33-a8df39
Entity Space:
Number entities: 48
Categorical properties:
name values
0 provider [A, B, C]
Discrete properties:
name range interval values
0 cpu_family [0, 2] None [0, 1]
1 vcpu_size [0, 2] None [0, 1]
2 nodes [2, 6] None [2, 3, 4, 5]
Measurement Space:
experiment supported
0 replay.benchmark_performance True
'replay.benchmark_performance'
Inputs:
parameter type value parameterized
0 cpu_family required None na
1 vcpu_size required None na
2 nodes required None na
3 provider required None na
Outputs:
target property
0 wallClockRuntime
1 status
Sample Store identifier: 'a8df39'
Note
The set of points is defined by the properties in the Entity Space
- here 'cpu_family', 'provider', 'vcpu_size' and 'nodes' - and the values those properties can take.
Tip
Consider why the size of the entityspace is 48. Compare this to the number of rows in ml_export.csv
.
Exploring the discoveryspace
¶
Next we will run an operation that will "explore" the discoveryspace
we just created. Since we already have the data, ado
will transparently identify and reuse it. An example operation file is given in randomwalk_ml_multicloud_operation.yaml
. The contents are:
# Copyright (c) IBM Corporation
# SPDX-License-Identifier: MIT
metadata:
name: 'randomwalk-all'
description: 'Perform a random walk on all points in a space'
spaces:
- 'space-630588-bfebfe'
operation:
module:
moduleClass: RandomWalk
parameters:
numberEntities: 48
batchSize: 1
mode: 'random'
samplerType: 'generator'
singleMeasurement: True
To run the operation execute (replacing $DISCOVERY_SPACE_IDENTIFIER
with the identifier of the space you created):
ado create operation -f randomwalk_ml_multicloud_operation.yaml --set "spaces[0]=$DISCOVERY_SPACE_IDENTIFIER"
(RandomWalk pid=14797) Continuous batching: SUBMIT EXPERIMENT. Submitting experiment replay.benchmark_performance for provider.B-cpu_family.1-vcpu_size.1-nodes.4
(RandomWalk pid=14797)
(RandomWalk pid=14797) Continuous batching: SUMMARY. Entities sampled and submitted: 2. Experiments completed: 1 Waiting on 1 active requests. There are 0 dependent experiments
(RandomWalk pid=14797) Continuous Batching: EXPERIMENT COMPLETION. Received finished notification for experiment in measurement request in group 1: request-randomwalk-0.9.6.dev91+884f713b.dirty-c5ed4b-579021-experiment-benchmark_performance-entities-provider.B-cpu_family.1-vcpu_size.1-nodes.4 (explicit_grid_sample_generator)-requester-randomwalk-0.9.6.dev91+884f713b.dirty-c5ed4b-time-2025-07-29 20:03:00.976809+01:00
provider.B-cpu_family.1-vcpu_size.1-nodes.4
- and experiment - replay.benchmark_performance
submitted. The next line gives a summary of what has happened so far: this is the second entity sampled and submitted; one experiment has completed; and the sampler is waiting on one active experiment before submitting a new one. Finally, the "EXPERIMENT COMPLETION" line indicates the experiment has finished. The operation will end with information like:
config:
operation:
module:
moduleClass: RandomWalk
moduleName: orchestrator.modules.operators.randomwalk
modulePath: .
moduleType: operation
parameters:
batchSize: 1
mode: sequential
numberEntities: 48
samplerType: generator
spaces:
- space-65cf33-a8df39
created: '2025-06-20T13:03:46.763154Z'
identifier: randomwalk-0.9.4.dev30+564196d4.dirty-b8a233
kind: operation
metadata:
entities_submitted: 48
experiments_requested: 74
operationType: search
operatorIdentifier: randomwalk-0.9.4.dev30+564196d4.dirty
status:
- event: created
recorded_at: '2025-06-20T13:03:40.267005Z'
- event: added
recorded_at: '2025-06-20T13:03:46.764750Z'
- event: started
recorded_at: '2025-06-20T13:03:46.769169Z'
- event: finished
exit_state: success
recorded_at: '2025-06-20T13:03:48.369516Z'
- event: updated
recorded_at: '2025-06-20T13:03:48.374765Z'
version: v1
Note the value of the identifier
field: in above it is randomwalk-0.9.4.dev30+564196d4.dirty-b8a233
Note
The operation "reuses" existing measurements: this is an ado
feature called memoization.
ado
transparently executes experiments or memoizes data as appropriate - so the operator does not need to know if a measurement needs to be performed at the time it requests it, or if previous data can be reused.
Tip
Operations are domain agnostic. If you look in randomwalk_ml_multicloud_operation.yaml
you will see there is no reference to characteristics of the discoveryspace we created. Indeed, this operation file could work on any discoveryspace.
This shows that operators, like randomwalk, don't have to know domain specific details. All information about what to explore and how to measure is captured in the discoveryspace
.
Looking at the operation
output¶
The command
ado show entities operation $OPERATION_IDENTIFIER
result_index identifier benchmark_performance-wallClockRuntime benchmark_performance-status reason valid
request_index
0 0 C_f1.0-c1.0-n4 114.014369 ok True
1 0 A_f0.0-c0.0-n2 335.208518 ok True
2 0 provider.B-cpu_family.0-vcpu_size.1-nodes.5 Externally defined experiments cannot be applied to entities: replay.benchmark_performance. False
3 0 C_f1.0-c0.0-n4 177.723598 ok True
4 0 B_f1.0-c0.0-n5 [168.79178500175476, 141.99024295806885] [ok, ok] True
5 0 A_f1.0-c1.0-n4 116.314171 ok True
6 0 C_f1.0-c1.0-n2 363.285671 ok True
7 0 A_f0.0-c0.0-n5 [106.0709307193756, 130.30512285232544] [ok, ok] True
8 0 C_f0.0-c0.0-n5 [150.9471504688263, 138.0605161190033] [ok, ok] True
9 0 B_f1.0-c0.0-n4 [202.48239731788635, 193.55997109413147] [ok, ok] True
10 0 C_f0.0-c0.0-n2 415.829285 ok True
11 0 B_f0.0-c0.0-n4 [113.87676978111269, 132.5415120124817] [ok, ok] True
12 0 C_f1.0-c0.0-n2 463.396539 ok True
13 0 A_f1.0-c1.0-n5 [96.8471610546112, 105.63729166984558] [ok, ok] True
14 0 A_f0.0-c0.0-n3 [221.5101969242096, 216.394127368927] [ok, ok] True
15 0 B_f1.0-c1.0-n2 298.819305 ok True
16 0 C_f1.0-c1.0-n3 [154.9813470840454, 168.34859228134155] [ok, ok] True
17 0 C_f0.0-c1.0-n2 309.842324 ok True
18 0 provider.B-cpu_family.1-vcpu_size.1-nodes.3 Externally defined experiments cannot be applied to entities: replay.benchmark_performance. False
19 0 B_f0.0-c0.0-n5 [113.88505148887634, 103.90595746040344, 112.7056987285614] [ok, ok, ok] True
20 0 C_f0.0-c1.0-n3 [168.9163637161255, 174.0335624217987] [ok, ok] True
21 0 B_f0.0-c0.0-n2 [228.14362454414368, 225.1791422367096] [ok, ok] True
22 0 B_f0.0-c1.0-n2 [166.74843192100525, 184.935049533844] [ok, ok] True
23 0 provider.B-cpu_family.1-vcpu_size.1-nodes.5 Externally defined experiments cannot be applied to entities: replay.benchmark_performance. False
24 0 B_f1.0-c0.0-n2 346.070996 ok True
25 0 C_f0.0-c0.0-n4 188.090878 ok True
26 0 A_f1.0-c1.0-n2 291.904456 ok True
27 0 C_f1.0-c0.0-n3 [244.33887457847595, 598.8834657669067] [ok, Timed out.] True
28 0 A_f0.0-c1.0-n2 272.997822 ok True
29 0 provider.B-cpu_family.1-vcpu_size.1-nodes.4 Externally defined experiments cannot be applied to entities: replay.benchmark_performance. False
30 0 A_f1.0-c1.0-n3 [155.02856159210205, 151.58562421798706] [ok, ok] True
31 0 A_f0.0-c1.0-n4 106.670121 ok True
32 0 A_f1.0-c0.0-n3 [206.74496150016785, 236.1715066432953] [ok, ok] True
33 0 C_f0.0-c0.0-n3 [269.0906641483307, 240.07358503341675] [ok, ok] True
34 0 A_f1.0-c0.0-n2 378.31657 ok True
35 0 provider.B-cpu_family.0-vcpu_size.1-nodes.3 Externally defined experiments cannot be applied to entities: replay.benchmark_performance. False
36 0 B_f1.0-c0.0-n3 [273.7120273113251, 220.19828414916992] [ok, ok] True
37 0 A_f1.0-c0.0-n4 158.706395 ok True
38 0 A_f0.0-c0.0-n4 145.129484 ok True
39 0 A_f0.0-c1.0-n3 [170.15659737586975, 168.36590766906738] [ok, ok] True
40 0 A_f0.0-c1.0-n5 [86.23016095161438, 84.45346999168396] [ok, ok] True
41 0 B_f0.0-c0.0-n3 [184.44801592826843, 153.51639366149902, 176.28814435005188] [ok, ok, ok] True
42 0 C_f1.0-c1.0-n5 [100.97977471351624, 92.17141437530518] [ok, ok] True
43 0 provider.B-cpu_family.0-vcpu_size.1-nodes.4 Externally defined experiments cannot be applied to entities: replay.benchmark_performance. False
44 0 C_f1.0-c0.0-n5 [136.3071050643921, 135.47050046920776] [ok, ok] True
45 0 C_f0.0-c1.0-n4 121.424925 ok True
46 0 A_f1.0-c0.0-n5 [117.94136571884157, 135.91092538833618] [ok, ok] True
47 0 C_f0.0-c1.0-n5 [95.86326050758362, 85.67946743965149] [ok, ok] True
Tip
Some things to note and consider:
- The table is in the order the points were measured
- Some points have multiple measurements c.f. size of entityspace versus the number of rows in
ml_export.csv
. - Some points were not measured - these are points in the discoveryspace for which no data was present to replay.
Exploring Further¶
Here are a variety of commands you can try after executing the example above:
Viewing entities¶
There are multiple ways to few the entities related to a discoveryspace
. Try:
ado show entities space $DISCOVERY_SPACE_IDENTIFIER
ado show entities space $DISCOVERY_SPACE_IDENTIFIER --aggregate mean
ado show entities space $DISCOVERY_SPACE_IDENTIFIER --include unmeasured
ado show entities space $DISCOVERY_SPACE_IDENTIFIER --property-format target
ado show details space $DISCOVERY_SPACE_IDENTIFIER
Resource provenance¶
The related
sub-command shows resource provenance e.g.
ado show related operation $OPERATION_IDENTIFIER
Operation timeseries¶
The following commands give more details of the operation timeseries
ado show results operation $OPERATION_IDENTIFIER
ado show requests operation $OPERATION_IDENTIFIER
Resource templates¶
Another helpful command is template
which will output a default example of a resource YAML along with an (optional) description of its fields. Try:
ado template operation --include-schema --operator-name random_walk
Rerun¶
An interesting thing to try is to run the operation again and compare the output of show entities operation
for the two operations, and show entities space
.
Takeaways¶
- create-explore-view pattern: A common pattern in
ado
is to create adiscoveryspace
to describe a set of points to measure, createoperations
on it to explore or analyse it, and then view the results - entity space and measurement space: A
discoveryspace
consists of anentityspace
- the set of points to measure - and ameasurementspace
- the set of experiments to apply to them. - operations are domain agnostic:
ado
enables operations to run on multiple different domains without modification - memoization: By default
ado
will identify if a measurement has already been completed on an entity and reuse it - provenance:
ado
stores the relationship between the resources it creates - results viewing:
ado show entities
outputs the data in adiscoveryspace
or measured in anoperation
- measurement timeseries: The sequence (timeseries) of measurements, successful or not, of each
operation
is preserved discoveryspace
views: By defaultado show entities space
only shows successfully measured entities , but you can see what has not been measured if you want
What's next¶
-
Search using an optimizer
Try the Search a space with an optimizer example to see how you can use RayTune, and define custom experiments, via
ado
. -
Discovering important entity space dimensions
Try the Identify the important dimensions of a space example to see how you can use
ado
to discover which entity space dimensions most influence a target metric.