Skip to content

Running ado on remote Ray clusters

Overview

Running ado on a remote Ray cluster enables long-running operations that can utilize multiple nodes and large amounts of compute-resource like GPUs. Such resources may also be a requirement for certain experiments or actuators.

The --remote option automates the steps required to dispatch any ado command to a remote Ray cluster. It handles packaging files, building plugin wheels, generating the Ray runtime environment, and running ray job submit for you.

Prerequisites

Only remote project contexts are supported

The project context used must be remote, as it must be accessible when ado executes on the remote ray cluster. ado will fail with a clear error if a SQLite context is detected.

Cluster login

If your cluster requires a port-forward, oc (OpenShift CLI) or kubectl must be installed, and you must be logged in to the cluster.

Defining a remote execution context

The details about a remote execution environment, where it is, what packages to install, and what environment variables to set, are defined in a YAML configuration file. Here we will call this file remote_context.yaml but it can have any name. There can be multiple such files for different remote clusters, or for specifying different environments on those clusters.

The minimal example uses a Ray cluster that is directly reachable at a known URL:

executionType:
  type: cluster
  clusterUrl: "http://ray-cluster.my-namespace.svc.cluster.local:8265"
packages:
  fromPyPI:
    - ado-core
    - ado-ray-tune # Add any other plugins required by your operation
envVars:
  PYTHONUNBUFFERED: "x"
  OMP_NUM_THREADS: "1"
  OPENBLAS_NUM_THREADS: "1"
  RAY_AIR_NEW_PERSISTENCE_MODE: "0"
wait: false # Set to true to remain attached until the job finishes

If your cluster is only reachable via a port-forward (common on OpenShift), add the portForward sub-field. ado will start the port-forward automatically before submitting and tear it down after:

executionType:
  type: cluster
  clusterUrl: "http://localhost:8265" # Must match localPort below
  portForward:
    namespace: my-namespace
    serviceName: my-ray-cluster-head-svc
    localPort: 8265 # Default; the port oc/kubectl will bind locally
packages:
  fromPyPI:
    - ado-core
    - ado-ray-tune
envVars:
  PYTHONUNBUFFERED: "x"
  OMP_NUM_THREADS: "1"
wait: false

Submitting commands

Ray version mismatch errors

If you encounter an error like:

RuntimeError: Changing the ray version is not allowed:
  current version: 2.54.0,   expect version: 2.52.1

This means the Ray version installed in your cluster differs from the version that will be installed by your dependencies. To resolve this, explicitly pin the Ray version in your fromPyPI section to match the cluster's version:

packages:
  fromPyPI:
    - ado-core
    - ray==2.52.1 # Match the cluster's Ray version
    - ado-ray-tune

Pass --remote as a global option before any ado command.

By default, ado will use the current active context as the context for the remote command.

ado --remote remote_context.yaml create operation -f operation.yaml

All ado commands are supported. For example, to query the metastore remotely:

ado --remote remote_context.yaml get space

You can also supply a project context directly using -c

ado -c mysql_project.yaml --remote remote_context.yaml create operation -f operation.yaml

What --remote does

For each invocation ado will:

  1. Copy the project context file and any -f resource files to a temporary working directory.
  2. Build wheels for any fromSource plugin paths.
  3. Generate a runtime_env.yaml from the packages and envVars fields.
  4. Start a port-forward if portForward is configured.
  5. Run ray job submit with the assembled working directory and runtime environment.
  6. Tear down the port-forward (if started) and exit with the job's exit code.

Installing python packages on a remote Ray cluster

When executing on a remote Ray cluster you often need to install additional packages, either from PyPI or local development. There are three methods available:

Ray python package caching

Ray caches packages it is asked to install so they are only downloaded, and potentially built, the first time they are requested.

Pre-installing ado packages

In this method ado and the required plugins are already installed in the Ray cluster's base python environment i.e. in the image used for head and worker nodes.

In this case you do not need to specify any packages in your remote_context.yaml. This method has the benefit of not having any overhead in job start from python package download or build steps.

Using additional plugins with pre-installed ado

If you need additional plugins or different versions of pre-installed plugins you must do a dynamic installation of ado-core and all actuators you need. This is because:

  • The pre-installed ado command is tied to the base-environment
  • It will not see new packages. You need to install it into the job's virtualenv
  • The ado_actuators namespace package will be superseded by one created in the job's virtualenv
  • Actuators in the same namespace package in the base environment will not be seen

Dynamic installation from pypi

The recommended method is to specify ado-core and the pypi package names of any plugins required in the packages.fromPyPI section of your remote_context.yaml.

Wheel paths and fromPyPI

Entries in fromPyPI that resolve to an existing .whl file on the machine running ado --remote will be transferred to the remote cluster. Other entries are forwarded unchanged to the cluster's uv install step. This includes paths that were not present on submitting machine - these will be interpreted as paths to wheels that are on the remote filesystem.

Dynamic installation from source

If you need to install plugins or packages from source, specify the path to them in the packages.fromSource section of your remote_context.yaml. Note: If the path is relative it will be resolved from where you execute ado --remote ...

executionType:
  type: cluster
  clusterUrl: "http://localhost:8265"
packages:
  fromPyPI:
    - ado-core
  fromSource:
    - plugins/actuators/vllm_performance # Assumes execute ado --remote from route of ado repo
wait: false
envVars:
  PYTHONUNBUFFERED: "x"

ado then will:

  1. Build python wheels for those packages
  2. Instruct Ray to install the wheels as part of the Ray job submission

Sending additional files

If you have additional files that need to be sent use the additionalFiles field of the remote execution context YAML. This can be required for example if an operator or actuator requires these files as input.

The paths can be absolute or relative. If relative they are resolved with respect to the directory ado --remote [COMMAND] is executed from.

executionType:
  type: cluster
  clusterUrl: "http://localhost:8265"
packages:
  fromPyPI:
    - ado-core
  fromSource:
    - plugins/actuators/vllm_performance
wait: false
envVars:
  PYTHONUNBUFFERED: "x"
additionalFiles:
  - /absolute/path/to/data_file.csv
  - path/to/my_data_dir/ # directories are also supported

Using Ray’s uv run driver integration with ado

Ray’s uv run integration

Ray provides a native integration that allows uv run ... to function as an "environment-aware" driver launch. It automatically packages the working directory and applies uv-based runtime configurations directly to worker nodes. This serves as a built-in mechanism for seamless dependency and environment handling across a distributed cluster. For more details, see the Ray documentation on using uv for package management

ADO default: the integration is disabled unless you opt in

The ado and run_experiment CLIs (including when invoked via uv run …) disable Ray's uv run integration by default. Unless the user has explicitly set RAY_ENABLE_UV_RUN_RUNTIME_ENV, ado sets it to 0 before importing Ray. This is done to avoid unintentionally packaging/uploading your entire current working directory during typical local development runs.

Enabling Ray’s uv run driver integration

If you'd like to use Ray’s uv run driver integration feature, set this in your shell before starting ado

export RAY_ENABLE_UV_RUN_RUNTIME_ENV=1

To have the (uv-run-started) driver in ado connect to an existing Ray cluster, set RAY_ADDRESS in the environment

export RAY_ADDRESS=...
uv run ado create op ...