Contributing a Benchmark for a Model¶
This guide walks you through adding a benchmark for an algorithm that you have already registered in Algorithm Nexus. If you have not yet registered your algorithm, start with Contributing a Python Algorithm Package to Algorithm Nexus.
There are four steps to add a benchmark for your algorithm:
- Find or create a benchmark experiment
- Register the experiment with your Nexus package
- Define a benchmark instance for your algorithm
- Run the benchmark
Prerequisites¶
Before you begin, ensure:
- Your algorithm is registered in a Nexus package under
packages/<package-name>/models/<model-name>/model.yaml. See Contributing a Python Algorithm Package if not. uvis installed on your system.- You have a fork of the Algorithm Nexus repository checked out locally and your development environment is set up:
Step 1: Find or create a benchmark experiment¶
A benchmark experiment is a Python package that defines how to evaluate an algorithm. First check whether a suitable experiment already exists in Algorithm Nexus before creating a new one.
Find an existing experiment¶
List all benchmark experiments registered across Algorithm Nexus:
If an experiment covers the evaluation you need, note its experiment ID and the benchmark package that provides it — you will reference both in the next step. Skip ahead to Step 2.
To inspect the inputs, outputs, and parameters of a specific experiment, install
the benchmark package it belongs to and use the ado CLI:
Create a new experiment¶
If no existing experiment fits your needs, create one following the
ADO custom experiment template.
A benchmark experiment is a standard ado custom experiment packaged as a
Python package.
Place the package under your Nexus package directory:
packages/<package-name>/
└── benchmark_packages/
└── <benchmark-package-name>/
├── pyproject.toml
└── src/
Step 2: Register the experiment with your Nexus package¶
Edit packages/<package-name>/nexus.yaml to declare that it uses a the
experiment from the given package
package:
name: <package-name>
benchmark_packages:
# Local package stored in this repository
- requirement_specifier: "./packages/<package-name>/benchmark_packages/<experiment-package-name>"
experiments:
- "<experiment-id>"
# Remote package hosted on GitHub
- requirement_specifier: "https://github.com/<org>/<experiment-package-repo>"
experiments:
- "<experiment-id>"
# Package published on PyPI
- requirement_specifier: "<experiment-package-name>"
experiments:
- "<experiment-id>"
Each requirement_specifier must resolve to a Python package that contains an
ado custom experiment.
Validate the updated package configuration:
Fix any validation errors before proceeding.
Step 3: Define a benchmark instance for your model¶
A benchmark instance specifies executing a registered experiment on your
model/algorithm with a specific set of parameter values (workload). Each
benchmark instance is a folder under your model's benchmark_instances/
directory, that contains a ado space.yaml
Create the directory,
Create
packages/<package-name>/models/<model-name>/benchmark_instances/<instance-name>/space.yaml:
Example:
entitySpace:
- identifier: dataset
propertyDomain:
values: ["<dataset-name>"]
- identifier: split
propertyDomain:
values: ["test"]
experiments:
- actuatorIdentifier: custom_experiments
experimentIdentifier: <experiment-id>
The entitySpace defines the workload parameters for this instance. The
experimentIdentifier must match one of the experiment IDs you registered in
nexus.yaml in the previous step.
For full details on space.yaml syntax, see the
ADO discoveryspace documentation.
Validate the package again to confirm the instance is well-formed:
Step 4: Run the benchmark¶
Install the benchmark package and run the benchmark instance locally using the
ado CLI. First save the following operation configuration to a file op.yaml.
It will execute the experiment on all points in your space.
metadata:
name: randomwalk-all
spaces:
- dynamically_inserted
operation:
module:
operatorName: random_walk
operationType: search
parameters:
numberEntities: all
samplerConfig:
samplerType: generator
mode: random
Then,
uv pip install packages/<package-name>/benchmark_packages/<experiment-package-name>
ado create space -f <space-yaml-path>
ado create operation -f op.yaml --use-latest space
See the ADO documentation for the full set of execution options including parameter sweeps and remote execution.
Step 5: Commit Changes and Open a Pull Request¶
Add and commit your changes:
git add packages/<package-name>
git add benchmark_packages #required when also adding a benchmark package during Step 1
git commit -s -m "feat(benchmark): Add <instance-name> benchmark for <model-name>"
git push origin <your-branch>
Open a pull request from your fork to the Algorithm Nexus main branch.
Optional: Mapping to logical benchmarks¶
Once your benchmark is running, you can connect it to a logical benchmark so the results can be compared with results from other experiments targeting the same type of problem. This requires creating a benchmark biding that maps the experiments internal parameters and metrics to a shared, canonical vocabulary.
This is done in two parts:
- Creating the logical benchmark definition (if it does not exist)
- Adding the benchmark binding for your experiment to the logical benchmark
Check for an existing logical benchmark definition¶
Logical benchmark definitions live in the benchmarks/ directory at the root of
the Algorithm Nexus repository. Browse that directory to see whether a
definition already exists for your domain (e.g. inference_serving.yaml,
max_cut_solver.yaml).
- If a definition exists — you add a binding for experiment to it. Continue to Add a benchmark binding for your experiment.
- If no definition exists — you will create one first. Continue to Create a logical benchmark definition.
Create a logical benchmark definition¶
If no definition exists, create a YAML file in
benchmarks/<logical-benchmark-id>.yaml. This file establishes the canonical
vocabulary for your benchmark — property names, valid values, and metric names.
A minimal example:
logicalBenchmark:
benchmarkIdentifier: inference_serving
description: >
Evaluation of AI model inference serving throughput and latency under
controlled traffic conditions.
target:
- identifier: model
metadata:
description: "The id of the AI model being benchmarked"
properties:
- identifier: dataset
metadata:
description: "Dataset used for inference requests."
# No domain: Will be OPEN_CATEGORICAL_DOMAIN by default
- identifier: workload
metadata:
description: "Traffic pattern or workload profile."
domain:
values: ["steady_state_heavy", "poisson_bursty", "light_load"]
metrics:
- throughput_tokens_per_second
- time_to_first_token_ms
owner: "@vllm-team"
bindings: [] # List of bindings to this benchmark
For the full schema and a complete worked example, see Section 2 of the Benchmark Metadata Convention.
Add a benchmark binding for your experiment¶
The binding is added to the list of bindings in the logical benchmark yaml.
logicalBenchmark: ... #logical benchmark fields
bindings:
- experiment:
experimentIdentifer: myexperiment
experimentVersion: 1.2
actuatorIdentifier: customexperiments
targetMapping: ...
- #next binding
A binding contains the following mapping sections: `
targetMapping: Maps the target property of the benchmark to an experiment input propertypropertyMappings: Maps the other inputs of the logical benchmark to inputs of the experimentmetricMappings: Maps outputs of the logical benchmark to the outputs of the experiment
An example binding is
benchmarkIdentifier: inference_serving
experiment:
actuatorIdentifier: vllm_performance
experimentIdentifier: guide_llm_runner
experimentVersion: 2.0.0 # The binding only uses the major version
targetMapping: model_name #The property model_name maps to the target property defined in the benchmark
propertyMappings:
- benchmark:
identifier: dataset
experiment:
identifier: input_data_path # guide-llm's internal param name
- categoricalValue:
property:
identifier: workload
value: steady_state_heavy
predicate:
- identifier: traffic_shape
domain:
values: ["constant"]
- identifier: concurrency
domain:
domainRange: [100, 1000]
variableType: CONTINUOUS_VARIABLE_TYPE
- categoricalValue:
property:
identifier: workload
value: steady_state_heavy
predicate:
- identifier: traffic_shape
domain:
values: ["poisson"]
- identifier: concurrency
domain:
domainRange: [1, 100]
variableType: CONTINUOUS_VARIABLE_TYPE
metricMapping:
- benchmark:
identifier: throughput_tokens_per_second
experiment:
identifier: throughput_rps # guide-llm's internal param name
- benchmark:
identifier: time_to_first_token_ms
experiment:
identifier: ttft_ms
For the full benchmark binding schema and worked examples, see the Benchmark Metadata Convention.
Getting Help¶
If you encounter issues:
- Check the Benchmark Integration Design
- Check the Benchmark Metadata Convention
- Refer to the ADO documentation
- Search existing issues on GitHub
- Open a new issue with details about your problem