Running ado on remote Ray clusters¶
Overview
Running ado on a remote Ray cluster enables long-running operations that can utilize multiple nodes and large amounts of compute-resource like GPUs. Such resources may also be a requirement for certain experiments or actuators.
The --remote option automates the steps required to dispatch any ado command to a remote Ray cluster. It handles packaging files, building plugin wheels, generating the Ray runtime environment, and running ray job submit for you.
Prerequisites¶
Only remote project contexts are supported
The project context used must be remote, as it must be accessible when ado executes on the remote ray cluster. ado will fail with a clear error if a SQLite context is detected.
Cluster login
If your cluster requires a port-forward, oc (OpenShift CLI) or kubectl must be installed, and you must be logged in to the cluster.
Defining a remote execution context¶
The details about a remote execution environment, where it is, what packages to install, and what environment variables to set, are defined in a YAML configuration file. Here we will call this file remote_context.yaml but it can have any name. There can be multiple such files for different remote clusters, or for specifying different environments on those clusters.
The minimal example uses a Ray cluster that is directly reachable at a known URL:
executionType:
type: cluster
clusterUrl: "http://ray-cluster.my-namespace.svc.cluster.local:8265"
packages:
fromPyPI:
- ado-core
- ado-ray-tune # Add any other plugins required by your operation
envVars:
PYTHONUNBUFFERED: "x"
OMP_NUM_THREADS: "1"
OPENBLAS_NUM_THREADS: "1"
RAY_AIR_NEW_PERSISTENCE_MODE: "0"
wait: false # Set to true to remain attached until the job finishes
If your cluster is only reachable via a port-forward (common on OpenShift), add the portForward sub-field. ado will start the port-forward automatically before submitting and tear it down after:
executionType:
type: cluster
clusterUrl: "http://localhost:8265" # Must match localPort below
portForward:
namespace: my-namespace
serviceName: my-ray-cluster-head-svc
localPort: 8265 # Default; the port oc/kubectl will bind locally
packages:
fromPyPI:
- ado-core
- ado-ray-tune
envVars:
PYTHONUNBUFFERED: "x"
OMP_NUM_THREADS: "1"
wait: false
Submitting commands¶
Ray version mismatch errors
If you encounter an error like:
RuntimeError: Changing the ray version is not allowed:
current version: 2.54.0, expect version: 2.52.1
This means the Ray version installed in your cluster differs from the version that will be installed by your dependencies. To resolve this, explicitly pin the Ray version in your fromPyPI section to match the cluster's version:
packages:
fromPyPI:
- ado-core
- ray==2.52.1 # Match the cluster's Ray version
- ado-ray-tune
Pass --remote as a global option before any ado command.
By default, ado will use the current active context as the context for the remote command.
ado --remote remote_context.yaml create operation -f operation.yaml
All ado commands are supported. For example, to query the metastore remotely:
ado --remote remote_context.yaml get space
You can also supply a project context directly using -c
ado -c mysql_project.yaml --remote remote_context.yaml create operation -f operation.yaml
What --remote does
For each invocation ado will:
- Copy the project context file and any
-fresource files to a temporary working directory. - Build wheels for any
fromSourceplugin paths. - Generate a
runtime_env.yamlfrom thepackagesandenvVarsfields. - Start a port-forward if
portForwardis configured. - Run
ray job submitwith the assembled working directory and runtime environment. - Tear down the port-forward (if started) and exit with the job's exit code.
Installing python packages on a remote Ray cluster¶
When executing on a remote Ray cluster you often need to install additional packages, either from PyPI or local development. There are three methods available:
- Pre-installing: Best when you are using the same actuators and operators constantly
- Dynamic installation from pypi: Best in general case
- Dynamic installation from source: Best for developers
Ray python package caching
Ray caches packages it is asked to install so they are only downloaded, and potentially built, the first time they are requested.
Pre-installing ado packages¶
In this method ado and the required plugins are already installed in the Ray cluster's base python environment i.e. in the image used for head and worker nodes.
In this case you do not need to specify any packages in your remote_context.yaml. This method has the benefit of not having any overhead in job start from python package download or build steps.
Using additional plugins with pre-installed ado
If you need additional plugins or different versions of pre-installed plugins you must do a dynamic installation of ado-core and all actuators you need. This is because:
- The pre-installed
adocommand is tied to the base-environment - It will not see new packages. You need to install it into the job's virtualenv
- The ado_actuators namespace package will be superseded by one created in the job's virtualenv
- Actuators in the same namespace package in the base environment will not be seen
Dynamic installation from pypi¶
The recommended method is to specify ado-core and the pypi package names of any plugins required in the packages.fromPyPI section of your remote_context.yaml.
Dynamic installation from source¶
If you need to install plugins or packages from source, specify the path to them in the packages.fromSource section of your remote_context.yaml. Note: If the path is relative it will be resolved from where you execute ado --remote ...
executionType:
type: cluster
clusterUrl: "http://localhost:8265"
packages:
fromPyPI:
- ado-core
fromSource:
- plugins/actuators/vllm_performance # Assumes execute ado --remote from route of ado repo
wait: false
envVars:
PYTHONUNBUFFERED: "x"
ado then will:
- Build python wheels for those packages
- Instruct Ray to install the wheels as part of the Ray job submission
Sending additional files¶
If you have additional files that need to be sent use the additionalFiles field of the remote execution context YAML. This can be required for example if an operator or actuator requires these files as input.
The paths can be absolute or relative. If relative they are resolved with respect to the directory ado --remote [COMMAND] is executed from.
executionType:
type: cluster
clusterUrl: "http://localhost:8265"
packages:
fromPyPI:
- ado-core
fromSource:
- plugins/actuators/vllm_performance
wait: false
envVars:
PYTHONUNBUFFERED: "x"
additionalFiles:
- /absolute/path/to/data_file.csv
- path/to/my_data_dir/ # directories are also supported