Skip to content

Algorithm Nexus Benchmarking System

Executive Summary

This document proposes a benchmarking system for the Algorithm Stack packages within Algorithm Nexus based on the benchmarking requirements.

An analysis of the benchmarking requirements indicates that ado natively fulfills the majority of the complex orchestration, data provenance, and scalable execution needs for evaluating benchmark targets against defined workloads. By combining ado and Ray with specific Algorithm Nexus Extensions, integration definitions, and robust administrative processes, the team can deliver a comprehensive, end-to-end benchmarking solution capable of generating repeatable benchmark results.

To fully satisfy these requirements, the design of the Benchmarking System is divided into three Architectural Pillars: System Architecture (The Mechanisms), Operational Architecture (The Infrastructure), and Governance & Conventions (The Standards).


1. System Architecture (The Mechanisms)

This pillar details the technical components, automated mechanisms, and execution engines that make up the benchmarking system.

1.1 Two-Tiered Packaging Architecture

The system utilizes a two-tiered architecture to strictly separate the definition of a benchmark experiment from its application to a specific AI model.

Tier Component Responsibility & Behavior
Tier 1: Benchmark Experiment Definition ado core Serves as the core capability engine. It provides the framework to define, package, and execute a self-contained benchmark experiment. It enforces strict input/output interfaces, handles versioning of the experiment logic, and manages the execution provenance independently of the target model.
Tier 2: Benchmark Integration nexus Package While ado knows how to run an experiment, nexus dictates when and against what. It provides the declarative metadata required to define a benchmark: bind a specific target (the model) to a specific ado benchmark experiment and a defined workload.

1.2 Event & Orchestration Broker

GitHub acts as the primary interface and event broker for the system. It captures user intent and system state changes (e.g., deployments or releases), routing these events to the underlying execution infrastructure. It acts as the technical bridge between human operations and the execution engine.

1.3 Execution and Orchestration Engine

The execution architecture relies on Ray and ado. ado leverages Ray to handle parameter sweeps and single benchmark instances mechanically. Thanks to ado's data recording capabilities, if one instance in a sweep fails ado continues orchestration and commits successful results to the database. Ray allows the underlying experiments to explicitly request hardware resources (e.g., @ray.remote(num_gpus=1)) via task decorators. Ray can also create per-task execution environments, allowing tests with incompatible requirements to ado-core or other experiments to execute.

1.4 Centralized Data & Discovery

The architecture utilizes ado distributed projects capabilities to store data, enforcing a uniform schema for results and custom metadata dictionaries. Furthermore, ado automatically registers available experiments upon environment installation, providing built-in commands to list and discover them.

System Architecture Requirements Matching

Requirement Name Fulfillment Strategy Component Proposed Solution
REQ 1.1 Input/Output Specification Technology ado core ado defines a standard programmatic input/output schema for benchmark experiments.
REQ 1.2 Python Package Technology ado core ado experiments are written purely in Python and distributed as standard packages.
REQ 1.5 Lifecycle Management Technology ado core ado natively provides a flag for experiments to mark deprecation.
REQ 2.2 Benchmark Experiment Discovery Technology nexus The nexus CLI provides built-in commands to list all registered experiments.
REQ 2.4 Benchmark Discovery Technology nexus The nexus CLI will enable listing all benchmarks defined in all packages (or a specific package or for a specific model).
REQ 3.3 Benchmark Experiment Reuse Technology ado + nexus Once registered as described in REQ 2.1, experiments can be universally referenced across projects.
REQ 4.1 Single & Sweep Execution Technology Ray + ado ado provides the capability to execute single experiment instances and parameter sweeps.
REQ 4.2 Resource Specification Technology Ray Ray allows a benchmark experiment to make explicit hardware resource requests.
REQ 4.4 Result Capture Technology ado DB ado commits successful benchmark results even if parallel instances fail.
REQ 4.5 Standardized Error Reporting Technology ado core Handled natively via standard Python error handling and custom ado return payloads.
REQ 4.8 Local Execution Technology ado core ado supports local execution for rapid prototyping on local compute.
REQ 5.1 Centralized Results Storage Technology ado DB ado provides centralized remote results storage.
REQ 5.2 Common Results Schema Technology ado core ado enforces a uniform, structured schema for all stored results.
REQ 5.3 Custom Metadata Support Technology ado DB ado supports returning custom metadata dicts alongside core results.

2. Operational Architecture (Workflows & Infrastructure)

This pillar details how the system is deployed, maintained, triggered, and scaled by the administrative team and CI/CD pipelines.

2.1 Infrastructure Configuration

Admins configure the Ray cluster on K8s via KubeRay, with hard namespace limits to maintain resource quotas during massive sweeps. To optimize performance, the underlying cluster mounts a shared persistent filesystem (via PVC) for workload dataset caching. Ray dynamically isolates worker node environments to prevent dependency version clashes between concurrent evaluations.

2.2 Orchestration Triggers & Automation

The mechanism for triggering centralized administrative evaluations is fully automated via GitHub. These are triggered mechanically via automated GitHub events (such as code deployments or releases) or on-demand utilizing GitHub ChatOps. They are executed with a combination of GitHub Actions (on Event or on schedule) and polling Runners. Global orchestration across multiple packages utilizes ado's native search space semantics.

Operational Requirements Matching

Requirement Name Fulfillment Strategy Component Proposed Solution
REQ 2.1 Benchmark Experiment Package Registration Technology + Process nexus Registering a benchmark experiments package involves adding metadata describing the package and the experiments it contains to a Nexus package, as well as validating that all referenced packages can be installed together. Packages provided via any mechanism outlined in REQ 3.2 can be registered. [PENDING: Nexus Test Dependencies Handling]
REQ 2.3 Benchmark Registration Technology + Process nexus + ado Registering a benchmark (see REQ 3.1 for specification) involves adding the relevant files and metadata to a nexus package model directory AND those files passing ado+nexus validation. [PENDING: Nexus Model Benchmark Specification Decision].
REQ 4.3 Resource Limits Technology + Process Ray Cluster Admins configure Ray clusters to set hard quotas per instance.
REQ 4.6 Logging Technology + Process Ray Cluster Admins configure infrastructure to persist logs without indefinite retention.
REQ 6.2 Isolated Execution Technology + Process ado + Ray Runtime Users can describe the benchmark experiment dependencies in the benchmark experiment package using ado + Ray semantics. Ray will dynamically create isolated virtual environments per worker.
REQ 6.3 Persistent Filesystem Technology + Process Ray / K8s Admins configure the cluster to mount a shared PVC for dataset caching.
REQ 7.2 Admin-Triggered Evaluation Execution Technology + Process GitHub Triggered via automated GitHub events or on-demand via GitHub ChatOps.

3. Governance & Conventions (Policies & Standards)

This pillar outlines the human-in-the-loop requirements, conventions, and security policies that contributors must adhere to in order for the technical and operational systems to function correctly.

3.1 Trust and Security Model

Nexus relies on an organizational trust model. Only authorized IBMers can submit code. To enforce security, all packages undergo mandatory standard CI/CD CVE scans before they are allowed into the execution environment.

3.2 Packaging and Versioning Conventions

While ado provides the mechanism for versioning and reproducibility, contributors are bound by strict conventions to ensure uniqueness and reliability.

  • Reproducibility Contract: Contributors must adhere to the convention that an experiment name plus specific parameter values defines a unique, repeatable execution. Repeatable here means the experiment instance use an identical process not produces the same result, as experiments can be stochastic.
  • Versioning: ado provides mechanisms for experiment versioning but does not prescribe any. The main convention w.r.t experiment versioning is that whatever mechanism is chosen ensures the Reproducibility Contract
  • Data Handling Guidelines: Workload data must either be bundled directly inside the benchmark experiment package or programmed to download dynamically at execution time.

3.3 Governance of Sweeps

Because parameter sweeps are computationally expensive, they must undergo particular scrutiny, with admins retaining manual and automated review oversight. Sweep configurations must pass GitHub PR approvals prior to being submitted to the Ray cluster for execution.

Governance Requirements Matching

Requirement Name Fulfillment Strategy Component Proposed Solution
REQ 1.3 Versioning Technology + Convention ado + nexus Users leverage ado capabilities to specify versions while adhering to semantic naming standards. [PENDING: Versioning Semantics Decision]
REQ 1.4 Reproducible Execution Technology + Convention ado + nexus Users must adhere to ado`'s convention that a given experiment name encodes a unique, repeatable experiment.
REQ 1.7 Required Data Technology + Convention ado Developers bundle data with benchmark experiment packages or the experiment downloads it dynamically.
REQ 4.7 Self-Contained Execution Technology + Convention ado As REQ 1.7
REQ 3.1 Benchmark Specification Technology + Process ado config Users specify benchmarks by creating an ado config that binds an experiment to a workload.
REQ 3.2 Providing Benchmark Experiments Technology + Process ado + nexus Benchmark experiment packages (following Standardized Benchmarking Packaging Protocol) can be provided in a Nexus package in the Algorithm Nexus repo, on PyPI or GitHub.
REQ 6.1 Admin Security Process CI Secured via trusted code submissions and mandatory CVE scans.
REQ 7.1 Nexus-Level Benchmarks Technology + Process ado + nexus These are benchmarks defined independently using ado configuration semantics and stored in the nexus repository. [PENDING: Nexus Repo Layout Decision]
REQ 7.3 Sweep Review and Approval Process GitHub PRs Admins retain review oversight of sweep configurations via GitHub PR workflows.

Open Questions

The following questions/decisions are open and can be resolved in subsequent issues.

  • Versioning Semantics for REQ 1.3
    • Rules and conventions for versioning benchmark experiments
  • Nexus Test Dependencies Handling for REQ 2.2
    • The process for validating that the benchmark packages referenced by a nexus package can be installed together
  • Nexus Model Benchmark Specification Decision for REQ 2.3
    • Exact YAML metadata and directory structure used to add a Nexus Model benchmark specification
  • Nexus Repo Layout for REQ 7.1
    • Exact YAML metadata and directory structure used to add a Nexus benchmark specification

How Nexus package developers will use the system

Contributing a benchmark experiment

Developers write and package the experiment according to the standardized packaging protocol (REQ 3.2) i.e. as an ado custom experiment or actuator+experiments. They put the package on GitHub, PyPI or in the algorithm nexus repo (REQ 3.2).

Defining the benchmark experiment packages used by a nexus package

Nexus package owners register the benchmark experiment packages, and the experiments they want to use, by referencing them in their nexus package's metadata (REQ 2.1) and validating that all the benchmark experiments the package needs can be installed together.

Defining a benchmark to use for a model

First developers can:

  • use nexus CLI and ado CLI to discover existing benchmark experiments (REQ 2.2)
  • use nexus CLI to discover existing benchmark specifications (REQ 2.4)

They then define their benchmark using an ado configuration (REQ 3.1) adding this to the model directory of the relevant nexus package (REQ 2.3). The benchmark configuration can reference any benchmark experiment registered by the Nexus package. If the benchmark experiment they need is not registered by the nexus package they can add it.. The benchmark configuration can also be based on one discovered via the Nexus CLI.