Skip to content

Contributing a Benchmark for a Model

This guide walks you through adding a benchmark for an algorithm that you have already registered in Algorithm Nexus. If you have not yet registered your algorithm, start with Contributing a Python Algorithm Package to Algorithm Nexus.

There are four steps to add a benchmark for your algorithm:

  1. Find or create a benchmark experiment
  2. Register the experiment with your Nexus package
  3. Define a benchmark instance for your algorithm
  4. Run the benchmark

Prerequisites

Before you begin, ensure:

  • Your algorithm is registered in a Nexus package under packages/<package-name>/models/<model-name>/model.yaml. See Contributing a Python Algorithm Package if not.
  • uv is installed on your system.
  • You have a fork of the Algorithm Nexus repository checked out locally and your development environment is set up:
cd algorithm-nexus
uv sync --group dev --extra cli

Step 1: Find or create a benchmark experiment

A benchmark experiment is a Python package that defines how to evaluate an algorithm. First check whether a suitable experiment already exists in Algorithm Nexus before creating a new one.

Find an existing experiment

List all benchmark experiments registered across Algorithm Nexus:

uv run nexus list benchmark-experiments

If an experiment covers the evaluation you need, note its experiment ID and the benchmark package that provides it — you will reference both in the next step. Skip ahead to Step 2.

To inspect the inputs, outputs, and parameters of a specific experiment, install the benchmark package it belongs to and use the ado CLI:

uv pip install <benchmark-package>
ado describe experiment <experiment-id>

Create a new experiment

If no existing experiment fits your needs, create one following the ADO custom experiment template. A benchmark experiment is a standard ado custom experiment packaged as a Python package.

Place the package under your Nexus package directory:

packages/<package-name>/
└── benchmark_packages/
    └── <benchmark-package-name>/
        ├── pyproject.toml
        └── src/

Step 2: Register the experiment with your Nexus package

Edit packages/<package-name>/nexus.yaml to declare that it uses a the experiment from the given package

package:
    name: <package-name>

    benchmark_packages:
        # Local package stored in this repository
        - requirement_specifier: "./packages/<package-name>/benchmark_packages/<experiment-package-name>"
          experiments:
              - "<experiment-id>"

        # Remote package hosted on GitHub
        - requirement_specifier: "https://github.com/<org>/<experiment-package-repo>"
          experiments:
              - "<experiment-id>"

        # Package published on PyPI
        - requirement_specifier: "<experiment-package-name>"
          experiments:
              - "<experiment-id>"

Each requirement_specifier must resolve to a Python package that contains an ado custom experiment.

Validate the updated package configuration:

uv run nexus validate packages/<package-name>

Fix any validation errors before proceeding.

Step 3: Define a benchmark instance for your model

A benchmark instance specifies executing a registered experiment on your model/algorithm with a specific set of parameter values (workload). Each benchmark instance is a folder under your model's benchmark_instances/ directory, that contains a ado space.yaml

Create the directory,

mkdir -p packages/<package-name>/models/<model-name>/benchmark_instances/<instance-name>

Create packages/<package-name>/models/<model-name>/benchmark_instances/<instance-name>/space.yaml:

Example:

entitySpace:
    - identifier: dataset
      propertyDomain:
          values: ["<dataset-name>"]
    - identifier: split
      propertyDomain:
          values: ["test"]

experiments:
    - actuatorIdentifier: custom_experiments
      experimentIdentifier: <experiment-id>

The entitySpace defines the workload parameters for this instance. The experimentIdentifier must match one of the experiment IDs you registered in nexus.yaml in the previous step.

For full details on space.yaml syntax, see the ADO discoveryspace documentation.

Validate the package again to confirm the instance is well-formed:

uv run nexus validate packages/<package-name>

Step 4: Run the benchmark

Install the benchmark package and run the benchmark instance locally using the ado CLI. First save the following operation configuration to a file op.yaml. It will execute the experiment on all points in your space.

metadata:
    name: randomwalk-all
spaces:
    - dynamically_inserted
operation:
    module:
        operatorName: random_walk
        operationType: search
    parameters:
        numberEntities: all
        samplerConfig:
            samplerType: generator
            mode: random

Then,

uv pip install packages/<package-name>/benchmark_packages/<experiment-package-name>
ado create space -f <space-yaml-path>
ado create operation -f op.yaml --use-latest space

See the ADO documentation for the full set of execution options including parameter sweeps and remote execution.

Step 5: Commit Changes and Open a Pull Request

Add and commit your changes:

git add packages/<package-name> 
git add benchmark_packages #required when also adding a benchmark package during Step 1
git commit -s -m "feat(benchmark): Add <instance-name> benchmark for <model-name>"
git push origin <your-branch>

Open a pull request from your fork to the Algorithm Nexus main branch.


Optional: Mapping to logical benchmarks

Once your benchmark is running, you can connect it to a logical benchmark so the results can be compared with results from other experiments targeting the same type of problem. This requires creating a benchmark biding that maps the experiments internal parameters and metrics to a shared, canonical vocabulary.

This is done in two parts:

  1. Creating the logical benchmark definition (if it does not exist)
  2. Adding the benchmark binding for your experiment to the logical benchmark

Check for an existing logical benchmark definition

Logical benchmark definitions live in the benchmarks/ directory at the root of the Algorithm Nexus repository. Browse that directory to see whether a definition already exists for your domain (e.g. inference_serving.yaml, max_cut_solver.yaml).

Create a logical benchmark definition

If no definition exists, create a YAML file in benchmarks/<logical-benchmark-id>.yaml. This file establishes the canonical vocabulary for your benchmark — property names, valid values, and metric names.

A minimal example:

logicalBenchmark:
  benchmarkIdentifier: inference_serving
  description: >
      Evaluation of AI model inference serving throughput and latency under
      controlled traffic conditions.
  target:
      - identifier: model
        metadata:
            description: "The id of the AI model being benchmarked"
  properties:
      - identifier: dataset
        metadata:
            description: "Dataset used for inference requests."
            # No domain: Will be OPEN_CATEGORICAL_DOMAIN by default
      - identifier: workload
        metadata:
            description: "Traffic pattern or workload profile."
        domain:
            values: ["steady_state_heavy", "poisson_bursty", "light_load"]
  metrics:
      - throughput_tokens_per_second
      - time_to_first_token_ms
  owner: "@vllm-team"
bindings: [] # List of bindings to this benchmark

For the full schema and a complete worked example, see Section 2 of the Benchmark Metadata Convention.

Add a benchmark binding for your experiment

The binding is added to the list of bindings in the logical benchmark yaml.

logicalBenchmark: ... #logical benchmark fields
bindings:
    - experiment:
          experimentIdentifer: myexperiment
          experimentVersion: 1.2
          actuatorIdentifier: customexperiments
      targetMapping: ...
    -  #next binding

A binding contains the following mapping sections: `

  • targetMapping: Maps the target property of the benchmark to an experiment input property
  • propertyMappings: Maps the other inputs of the logical benchmark to inputs of the experiment
  • metricMappings: Maps outputs of the logical benchmark to the outputs of the experiment

An example binding is

benchmarkIdentifier: inference_serving
experiment:
    actuatorIdentifier: vllm_performance
    experimentIdentifier: guide_llm_runner
    experimentVersion: 2.0.0 # The binding only uses the major version
targetMapping: model_name #The property model_name maps to the target property defined in the benchmark
propertyMappings:
    - benchmark:
          identifier: dataset
      experiment:
          identifier: input_data_path # guide-llm's internal param name
    - categoricalValue:
          property:
              identifier: workload
          value: steady_state_heavy
      predicate:
          - identifier: traffic_shape
            domain:
                values: ["constant"]
          - identifier: concurrency
            domain:
                domainRange: [100, 1000]
                variableType: CONTINUOUS_VARIABLE_TYPE
    - categoricalValue:
          property:
              identifier: workload
          value: steady_state_heavy
      predicate:
          - identifier: traffic_shape
            domain:
                values: ["poisson"]
          - identifier: concurrency
            domain:
                domainRange: [1, 100]
                variableType: CONTINUOUS_VARIABLE_TYPE
metricMapping:
    - benchmark:
          identifier: throughput_tokens_per_second
      experiment:
          identifier: throughput_rps # guide-llm's internal param name
    - benchmark:
          identifier: time_to_first_token_ms
      experiment:
          identifier: ttft_ms

For the full benchmark binding schema and worked examples, see the Benchmark Metadata Convention.


Getting Help

If you encounter issues:

  1. Check the Benchmark Integration Design
  2. Check the Benchmark Metadata Convention
  3. Refer to the ADO documentation
  4. Search existing issues on GitHub
  5. Open a new issue with details about your problem