Running `ado` on remote Ray clusters¶

Overview

Running ado on a remote Ray cluster enables long-running operations that can utilize multiple nodes and large amounts of compute-resource like GPUs. Such resources may also be a requirement for certain experiments or actuators.

The --remote option automates the steps required to dispatch any ado command to a remote Ray cluster. It handles packaging files, building plugin wheels, generating the Ray runtime environment, and running ray job submit for you.

Prerequisites¶

Only remote project contexts are supported

The project context used must be remote, as it must be accessible when ado executes on the remote ray cluster. ado will fail with a clear error if a SQLite context is detected.

Cluster login

If your cluster requires a port-forward, oc (OpenShift CLI) or kubectl must be installed, and you must be logged in to the cluster.

Defining a remote execution context¶

The details about a remote execution environment, where it is, what packages to install, and what environment variables to set, are defined in a YAML configuration file. Here we will call this file remote_context.yaml but it can have any name. There can be multiple such files for different remote clusters, or for specifying different environments on those clusters.

The minimal example uses a Ray cluster that is directly reachable at a known URL:

executionType:
  type: cluster
  clusterUrl: "http://ray-cluster.my-namespace.svc.cluster.local:8265"
packages:
  fromPyPI:
    - ado-core
    - ado-ray-tune # Add any other plugins required by your operation
envVars:
  PYTHONUNBUFFERED: "x"
  OMP_NUM_THREADS: "1"
  OPENBLAS_NUM_THREADS: "1"
  RAY_AIR_NEW_PERSISTENCE_MODE: "0"
wait: false # Set to true to remain attached until the job finishes

If your cluster is only reachable via a port-forward (common on OpenShift), add the portForward sub-field. ado will start the port-forward automatically before submitting and tear it down after:

executionType:
  type: cluster
  clusterUrl: "http://localhost:8265" # Must match localPort below
  portForward:
    namespace: my-namespace
    serviceName: my-ray-cluster-head-svc
    localPort: 8265 # Default; the port oc/kubectl will bind locally
packages:
  fromPyPI:
    - ado-core
    - ado-ray-tune
envVars:
  PYTHONUNBUFFERED: "x"
  OMP_NUM_THREADS: "1"
wait: false

Runtime environment setup options¶

When many jobs start at once (for example during cluster autoscaling, many concurrent measurements), Ray may need extra time to install each job's runtime environment on workers. Use the optional runtimeEnv block to tune Ray's runtime-env behaviour.

setupTimeoutSeconds (default 600): maximum seconds to create the runtime environment on a worker. Use -1 to disable the timeout.
eagerInstall (default true): if true, install the runtime environment on a worker when the job starts; if false, install lazily when the first task runs.

executionType:
  type: cluster
  clusterUrl: "http://ray-cluster.my-namespace.svc.cluster.local:8265"
packages:
  fromPyPI:
    - ado-core
runtimeEnv:
  setupTimeoutSeconds: 1200
  eagerInstall: false
envVars:
  PYTHONUNBUFFERED: "x"
wait: false

If runtimeEnv the defaults are used. These are the same as the Ray defaults.

Submitting commands¶

Ray version mismatch errors

If you encounter an error like:

RuntimeError: Changing the ray version is not allowed:
  current version: 2.54.0,   expect version: 2.52.1

This means the Ray version installed in your cluster differs from the version that will be installed by your dependencies. To resolve this, explicitly pin the Ray version in your fromPyPI section to match the cluster's version:

packages:
  fromPyPI:
    - ado-core
    - ray==2.52.1 # Match the cluster's Ray version
    - ado-ray-tune

Pass --remote as a global option before any ado command.

By default, ado will use the current active context as the context for the remote command.

ado --remote remote_context.yaml create operation -f operation.yaml

All ado commands are supported. For example, to query the metastore remotely:

ado --remote remote_context.yaml get space

You can also supply a project context directly using -c

ado -c mysql_project.yaml --remote remote_context.yaml create operation -f operation.yaml

What --remote does

For each invocation ado will:

Copy the project context file and any -f resource files to a temporary working directory.
Build wheels for any fromSource plugin paths.
Generate a runtime_env.yaml from the packages, envVars, and optional runtimeEnv fields.
Start a port-forward if portForward is configured.
Run ray job submit with the assembled working directory and runtime environment.
Tear down the port-forward (if started) and exit with the job's exit code.

Installing python packages on a remote Ray cluster¶

When executing on a remote Ray cluster you often need to install additional packages, either from PyPI or local development. There are three methods available:

Pre-installing: Best when you are using the same actuators and operators constantly
Dynamic installation from pypi: Best in general case
Dynamic installation from source: Best for developers

Ray python package caching

Ray caches packages it is asked to install so they are only downloaded, and potentially built, the first time they are requested.

Pre-installing ado packages¶

In this method ado and the required plugins are already installed in the Ray cluster's base python environment i.e. in the image used for head and worker nodes.

In this case you do not need to specify any packages in your remote_context.yaml. This method has the benefit of not having any overhead in job start from python package download or build steps.

Using additional plugins with pre-installed ado

If you need additional plugins or different versions of pre-installed plugins you must do a dynamic installation of ado-core and all actuators you need. This is because:

The pre-installed ado command is tied to the base-environment
It will not see new packages. You need to install it into the job's virtualenv
The ado_actuators namespace package will be superseded by one created in the job's virtualenv
Actuators in the same namespace package in the base environment will not be seen

Dynamic installation from pypi¶

The recommended method is to specify ado-core and the pypi package names of any plugins required in the packages.fromPyPI section of your remote_context.yaml.

Wheel paths and fromPyPI

Entries in fromPyPI that resolve to an existing .whl file on the machine running ado --remote will be transferred to the remote cluster. Other entries are forwarded unchanged to the cluster's uv install step. This includes paths that were not present on submitting machine - these will be interpreted as paths to wheels that are on the remote filesystem.

Dynamic installation from source¶

If you need to install plugins or packages from source, specify the path to them in the packages.fromSource section of your remote_context.yaml. Note: If the path is relative it will be resolved from where you execute ado --remote ...

executionType:
  type: cluster
  clusterUrl: "http://localhost:8265"
packages:
  fromPyPI:
    - ado-core
  fromSource:
    - plugins/actuators/vllm_performance # Assumes execute ado --remote from route of ado repo
wait: false
envVars:
  PYTHONUNBUFFERED: "x"

ado then will:

Build python wheels for those packages
Instruct Ray to install the wheels as part of the Ray job submission

Sending additional files¶

If you have additional files that need to be sent use the additionalFiles field of the remote execution context YAML. This can be required for example if an operator or actuator requires these files as input.

The paths can be absolute or relative. If relative they are resolved with respect to the directory ado --remote [COMMAND] is executed from.

executionType:
  type: cluster
  clusterUrl: "http://localhost:8265"
packages:
  fromPyPI:
    - ado-core
  fromSource:
    - plugins/actuators/vllm_performance
wait: false
envVars:
  PYTHONUNBUFFERED: "x"
additionalFiles:
  - /absolute/path/to/data_file.csv
  - path/to/my_data_dir/ # directories are also supported

Using Ray’s uv run driver integration with `ado`¶

Ray’s uv run integration¶

Ray provides a native integration that allows uv run ... to function as an "environment-aware" driver launch. It automatically packages the working directory and applies uv-based runtime configurations directly to worker nodes. This serves as a built-in mechanism for seamless dependency and environment handling across a distributed cluster. For more details, see the Ray documentation on using uv for package management

ADO default: the integration is disabled unless you opt in¶

The ado and run_experiment CLIs (including when invoked via uv run …) disable Ray's uv run integration by default. Unless the user has explicitly set RAY_ENABLE_UV_RUN_RUNTIME_ENV, ado sets it to 0 before importing Ray. This is done to avoid unintentionally packaging/uploading your entire current working directory during typical local development runs.

Enabling Ray’s `uv run` driver integration¶

If you'd like to use Ray’s uv run driver integration feature, set this in your shell before starting ado

export RAY_ENABLE_UV_RUN_RUNTIME_ENV=1

To have the (uv-run-started) driver in ado connect to an existing Ray cluster, set RAY_ADDRESS in the environment

export RAY_ADDRESS=...
uv run ado create op ...

Running ado on remote Ray clusters¶