Algorithm Nexus Benchmarking System¶

Executive Summary¶

This document proposes a benchmarking system for the Algorithm Stack packages within Algorithm Nexus based on the benchmarking requirements.

An analysis of the benchmarking requirements indicates that ado natively fulfills the majority of the complex orchestration, data provenance, and scalable execution needs for evaluating benchmark targets against defined workloads. By combining ado and Ray with specific Algorithm Nexus Extensions, integration definitions, and robust administrative processes, the team can deliver a comprehensive, end-to-end benchmarking solution capable of generating repeatable benchmark results.

To fully satisfy these requirements, the design of the Benchmarking System is divided into three Architectural Pillars: System Architecture (The Mechanisms), Operational Architecture (The Infrastructure), and Governance & Conventions (The Standards).

1. System Architecture (The Mechanisms)¶

This pillar details the technical components, automated mechanisms, and execution engines that make up the benchmarking system.

1.1 Two-Tiered Packaging Architecture¶

The system utilizes a two-tiered architecture to strictly separate the definition of a benchmark experiment from its application to a specific AI model.

Tier	Component	Responsibility & Behavior
Tier 1: Benchmark Experiment Definition	`ado` core	Serves as the core capability engine. It provides the framework to define, package, and execute a self-contained benchmark experiment. It enforces strict input/output interfaces, handles versioning of the experiment logic, and manages the execution provenance independently of the target model.
Tier 2: Benchmark Integration	`nexus` Package	While `ado` knows how to run an experiment, `nexus` dictates when and against what. It provides the declarative metadata required to define a benchmark: bind a specific target (the model) to a specific `ado` benchmark experiment and a defined workload.

1.2 Event & Orchestration Broker¶

GitHub acts as the primary interface and event broker for the system. It captures user intent and system state changes (e.g., deployments or releases), routing these events to the underlying execution infrastructure. It acts as the technical bridge between human operations and the execution engine.

1.3 Execution and Orchestration Engine¶

The execution architecture relies on Ray and ado. ado leverages Ray to handle parameter sweeps and single benchmark instances mechanically. Thanks to ado's data recording capabilities, if one instance in a sweep fails ado continues orchestration and commits successful results to the database. Ray allows the underlying experiments to explicitly request hardware resources (e.g., @ray.remote(num_gpus=1)) via task decorators. Ray can also create per-task execution environments, allowing tests with incompatible requirements to ado-core or other experiments to execute.

1.4 Centralized Data & Discovery¶

The architecture utilizes ado distributed projects capabilities to store data, enforcing a uniform schema for results and custom metadata dictionaries. Furthermore, ado automatically registers available experiments upon environment installation, providing built-in commands to list and discover them.

System Architecture Requirements Matching¶

Requirement	Name	Fulfillment Strategy	Component	Proposed Solution
REQ 1.1	Input/Output Specification	Technology	`ado` core	`ado` defines a standard programmatic input/output schema for benchmark experiments.
REQ 1.2	Python Package	Technology	`ado` core	`ado` experiments are written purely in Python and distributed as standard packages.
REQ 1.5	Lifecycle Management	Technology	`ado` core	`ado` natively provides a flag for experiments to mark deprecation.
REQ 2.2	Benchmark Experiment Discovery	Technology	nexus	The nexus CLI provides built-in commands to list all registered experiments.
REQ 2.4	Benchmark Discovery	Technology	nexus	The nexus CLI will enable listing all benchmarks defined in all packages (or a specific package or for a specific model).
REQ 3.3	Benchmark Experiment Reuse	Technology	`ado` + nexus	Once registered as described in REQ 2.1, experiments can be universally referenced across projects.
REQ 4.1	Single & Sweep Execution	Technology	Ray + `ado`	`ado` provides the capability to execute single experiment instances and parameter sweeps.
REQ 4.2	Resource Specification	Technology	Ray	Ray allows a benchmark experiment to make explicit hardware resource requests.
REQ 4.4	Result Capture	Technology	`ado` DB	`ado` commits successful benchmark results even if parallel instances fail.
REQ 4.5	Standardized Error Reporting	Technology	`ado` core	Handled natively via standard Python error handling and custom `ado` return payloads.
REQ 4.8	Local Execution	Technology	`ado` core	`ado` supports local execution for rapid prototyping on local compute.
REQ 5.1	Centralized Results Storage	Technology	`ado` DB	`ado` provides centralized remote results storage.
REQ 5.2	Common Results Schema	Technology	`ado` core	`ado` enforces a uniform, structured schema for all stored results.
REQ 5.3	Custom Metadata Support	Technology	`ado` DB	`ado` supports returning custom metadata dicts alongside core results.

2. Operational Architecture (Workflows & Infrastructure)¶

This pillar details how the system is deployed, maintained, triggered, and scaled by the administrative team and CI/CD pipelines.

2.1 Infrastructure Configuration¶

Admins configure the Ray cluster on K8s via KubeRay, with hard namespace limits to maintain resource quotas during massive sweeps. To optimize performance, the underlying cluster mounts a shared persistent filesystem (via PVC) for workload dataset caching. Ray dynamically isolates worker node environments to prevent dependency version clashes between concurrent evaluations.

2.2 Orchestration Triggers & Automation¶

The mechanism for triggering centralized administrative evaluations is fully automated via GitHub. These are triggered mechanically via automated GitHub events (such as code deployments or releases) or on-demand utilizing GitHub ChatOps. They are executed with a combination of GitHub Actions (on Event or on schedule) and polling Runners. Global orchestration across multiple packages utilizes ado's native search space semantics.

Operational Requirements Matching¶

Requirement	Name	Fulfillment Strategy	Component	Proposed Solution
REQ 2.1	Benchmark Experiment Package Registration	Technology + Process	nexus	Registering a benchmark experiments package involves adding metadata describing the package and the experiments it contains to a Nexus package, as well as validating that all referenced packages can be installed together. Packages provided via any mechanism outlined in REQ 3.2 can be registered. [PENDING: Nexus Test Dependencies Handling]
REQ 2.3	Benchmark Registration	Technology + Process	nexus + `ado`	Registering a benchmark (see REQ 3.1 for specification) involves adding the relevant files and metadata to a nexus package model directory AND those files passing ado+nexus validation. [PENDING: Nexus Model Benchmark Specification Decision].
REQ 4.3	Resource Limits	Technology + Process	Ray Cluster	Admins configure Ray clusters to set hard quotas per instance.
REQ 4.6	Logging	Technology + Process	Ray Cluster	Admins configure infrastructure to persist logs without indefinite retention.
REQ 6.2	Isolated Execution	Technology + Process	`ado` + Ray Runtime	Users can describe the benchmark experiment dependencies in the benchmark experiment package using `ado` + Ray semantics. Ray will dynamically create isolated virtual environments per worker.
REQ 6.3	Persistent Filesystem	Technology + Process	Ray / K8s	Admins configure the cluster to mount a shared PVC for dataset caching.
REQ 7.2	Admin-Triggered Evaluation Execution	Technology + Process	GitHub	Triggered via automated GitHub events or on-demand via GitHub ChatOps.

3. Governance & Conventions (Policies & Standards)¶

This pillar outlines the human-in-the-loop requirements, conventions, and security policies that contributors must adhere to in order for the technical and operational systems to function correctly.

3.1 Trust and Security Model¶

Nexus relies on an organizational trust model. Only authorized IBMers can submit code. To enforce security, all packages undergo mandatory standard CI/CD CVE scans before they are allowed into the execution environment.

3.2 Packaging and Versioning Conventions¶

While ado provides the mechanism for versioning and reproducibility, contributors are bound by strict conventions to ensure uniqueness and reliability.

Reproducibility Contract: Contributors must adhere to the convention that an experiment name plus specific parameter values defines a unique, repeatable execution. Repeatable here means the experiment instance use an identical process not produces the same result, as experiments can be stochastic.
Versioning: ado provides mechanisms for experiment versioning but does not prescribe any. The main convention w.r.t experiment versioning is that whatever mechanism is chosen ensures the Reproducibility Contract
Data Handling Guidelines: Workload data must either be bundled directly inside the benchmark experiment package or programmed to download dynamically at execution time.

3.3 Governance of Sweeps¶

Because parameter sweeps are computationally expensive, they must undergo particular scrutiny, with admins retaining manual and automated review oversight. Sweep configurations must pass GitHub PR approvals prior to being submitted to the Ray cluster for execution.

Governance Requirements Matching¶

Requirement	Name	Fulfillment Strategy	Component	Proposed Solution
REQ 1.3	Versioning	Technology + Convention	`ado` + nexus	Users leverage `ado` capabilities to specify versions while adhering to semantic naming standards. [PENDING: Versioning Semantics Decision]
REQ 1.4	Reproducible Execution	Technology + Convention	`ado` + nexus	Users must adhere to ado`'s convention that a given experiment name encodes a unique, repeatable experiment.
REQ 1.7	Required Data	Technology + Convention	`ado`	Developers bundle data with benchmark experiment packages or the experiment downloads it dynamically.
REQ 4.7	Self-Contained Execution	Technology + Convention	`ado`	As REQ 1.7
REQ 3.1	Benchmark Specification	Technology + Process	`ado` config	Users specify benchmarks by creating an `ado` config that binds an experiment to a workload.
REQ 3.2	Providing Benchmark Experiments	Technology + Process	`ado` + nexus	Benchmark experiment packages (following Standardized Benchmarking Packaging Protocol) can be provided in a Nexus package in the Algorithm Nexus repo, on PyPI or GitHub.
REQ 6.1	Admin Security	Process	CI	Secured via trusted code submissions and mandatory CVE scans.
REQ 7.1	Nexus-Level Benchmarks	Technology + Process	`ado` + nexus	These are benchmarks defined independently using `ado` configuration semantics and stored in the nexus repository. [PENDING: Nexus Repo Layout Decision]
REQ 7.3	Sweep Review and Approval	Process	GitHub PRs	Admins retain review oversight of sweep configurations via GitHub PR workflows.

Open Questions¶

The following questions/decisions are open and can be resolved in subsequent issues.

Versioning Semantics for REQ 1.3
- Rules and conventions for versioning benchmark experiments
Nexus Test Dependencies Handling for REQ 2.2
- The process for validating that the benchmark packages referenced by a nexus package can be installed together
Nexus Model Benchmark Specification Decision for REQ 2.3
- Exact YAML metadata and directory structure used to add a Nexus Model benchmark specification
Nexus Repo Layout for REQ 7.1
- Exact YAML metadata and directory structure used to add a Nexus benchmark specification

How Nexus package developers will use the system¶

Contributing a benchmark experiment¶

Developers write and package the experiment according to the standardized packaging protocol (REQ 3.2) i.e. as an ado custom experiment or actuator+experiments. They put the package on GitHub, PyPI or in the algorithm nexus repo (REQ 3.2).

Defining the benchmark experiment packages used by a nexus package¶

Nexus package owners register the benchmark experiment packages, and the experiments they want to use, by referencing them in their nexus package's metadata (REQ 2.1) and validating that all the benchmark experiments the package needs can be installed together.

Defining a benchmark to use for a model¶

First developers can:

use nexus CLI and ado CLI to discover existing benchmark experiments (REQ 2.2)
use nexus CLI to discover existing benchmark specifications (REQ 2.4)

They then define their benchmark using an ado configuration (REQ 3.1) adding this to the model directory of the relevant nexus package (REQ 2.3). The benchmark configuration can reference any benchmark experiment registered by the Nexus package. If the benchmark experiment they need is not registered by the nexus package they can add it.. The benchmark configuration can also be based on one discovered via the Nexus CLI.