Installing backend services
Using the distributed MySQL backend for ado¶
This guide is intended for administrators who are responsible for deploying the distributed MySQL backend for ADO or provisioning new projects on it.
Overview¶
We recommend using the Percona Operator for MySQL, which is built on Percona XtraDB Cluster, to provide a resilient and production-ready MySQL backend. This guide assumes that this setup is being used.
Deployment Instructions¶
Kubernetes¶
You can deploy the Percona Operator and create a Percona XtraDB Cluster using either of the following methods:
Click on the links to follow the official Percona documentation.
OpenShift¶
In OpenShift environments, the operator can be installed via OperatorHub using the Operator Lifecycle Manager (OLM).
Refer to the official OpenShift-specific guide here:
👉 OpenShift Deployment Guide
Onboarding projects¶
Warning
Before proceeding make sure you have followed the steps in Deployment Instructions.
Pre-requisites¶
Software¶
To run the scripts in this guide you will need to have the following tools installed:
kubectl
: https://kubernetes.io/docs/tasks/tools/#kubectlmysql
client version 8: https://formulae.brew.sh/formula/mysql-client@8.4
PXC-related variables¶
Note
We assume that your active namespace is the one in which you installed your Percona XtraDB Cluster.
PXC Cluster name¶
You will need to know the name of your pxc
cluster:
kubectl get pxc -o jsonpath='{.items[].metadata.name}'
We will refer to its name as $PXC_NAME
.
PXC Cluster root credentials¶
You will need a highly privileged account to onboard new projects, as you will need to create databases, users, and grant permissions. For this reason, we will use the default root
account.
You can retrieve its password with:
kubectl get secret $PXC_NAME-secrets --template='{{.data.root}}' | base64 -d
We will refer to this password as $MYSQL_ADMIN_PASSWORD
.
Onboarding new projects¶
The simplest way to onboard a new project called $PROJECT_NAME
is to use the forward_mysql_and_onboard_new_project.sh
. This script creates a new project in the MySQL DB and outputs an ado context YAML that can be used to connect to it.
For example:
./forward_mysql_and_onboard_new_project.sh --admin-user root \
--admin-pass $MYSQL_ADMIN_PASSWORD \
--pxc-name $PXC_NAME \
--project-name $PROJECT_NAME
Alternatively, if you are using a hosted MySQL instance somewhere (e.g., on the Cloud), you can use the other script: onboard_new_project.sh
:
./onboard_new_project.sh --admin-user root \
--admin-pass $MYSQL_ADMIN_PASSWORD \
--mysql-endpoint $MYSQL_ENDPOINT \
--project-name $PROJECT_NAME
Once the project is created the context YAML can be shared with whoever needs access to the project.
Deploying Kuberay and creating a RayCluster¶
This guide is intended for users who want to run operations on an autoscaling ray cluster deployed on kubernetes/OpenShift. Depending on cluster permissions users may need someone with administrator privileges to install KubeRay and/or create RayCluster objects.
Installing KubeRay¶
Warning
KubeRay is included in OpenShift AI and OpenDataHub. Skip this step if they are already installed in your cluster.
You can install the KubeRay Operator either via Helm or Kustomize by following the official documentation.
Deploying a RayCluster¶
Warning
The ray
versions must be compatible. For a more in depth guide refer to the RayCluster configuration page.
Note
When running multi-node measurement make sure that all nodes in your multi-node setup have read and write access to your HuggingFace home directory. On Kubernetes with RayCluster, avoid S3-like filesystems as that is known to cause failures in transformers. Use a NFS or GPFS-backed PersistentVolumeClaim instead.
Best Practices for Efficient GPU Resource Utilization¶
To maximize the efficiency of your RayCluster and minimize GPU resource fragmentation, we recommend the following:
-
Enable Ray Autoscaler
This allows Ray to dynamically adjust the number of worker replicas based on task demand. -
Use Multiple GPU Worker Variants
Define several GPU worker types with varying GPU counts. This flexibility helps match task requirements more precisely and reduces idle GPU time.
Recommended Worker Configuration Strategy¶
Create GPU worker variants with increasing GPU counts, where each variant has double the GPUs of the previous one. Limit each variant to a maximum of 2 replicas, ensuring that their combined GPU usage does not exceed the capacity of a single replica of the next larger variant.
Example: Kubernetes Cluster with 4 Nodes (8 GPUs Each)¶
Recommended worker setup:
- 2 replicas of a worker with 1 GPU
- 2 replicas of a worker with 2 GPUs
- 2 replicas of a worker with 4 GPUs
- 4 replicas of a worker with 8 GPUs
Example: The contents of the additionalWorkerGroups field of a RayCluster with 4 Nodes each with 8 NVIDIA-A100-SXM4-80GB GPUs, 64 CPU cores, and 1TB memory
one-A100-80G-gpu-WG:
replicas: 0
minReplicas: 0
maxReplicas: 2
rayStartParams:
block: 'true'
num-gpus: '1'
resources: '"{\"NVIDIA-A100-SXM4-80GB\": 1}"'
containerEnv:
- name: OMP_NUM_THREADS
value: "1"
- name: OPENBLAS_NUM_THREADS
value: "1"
lifecycle:
preStop:
exec:
command: [ "/bin/sh","-c","ray stop" ]
# securityContext: ...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-A100-SXM4-80GB
resources:
limits:
cpu: 8
nvidia.com/gpu: 1
memory: 100Gi
requests:
cpu: 8
nvidia.com/gpu: 1
memory: 100Gi
# volumes: ...
# volumeMounts: ....
two-A100-80G-gpu-WG:
replicas: 0
minReplicas: 0
maxReplicas: 2
rayStartParams:
block: 'true'
num-gpus: '2'
resources: '"{\"NVIDIA-A100-SXM4-80GB\": 2}"'
containerEnv:
- name: OMP_NUM_THREADS
value: "1"
- name: OPENBLAS_NUM_THREADS
value: "1"
lifecycle:
preStop:
exec:
command: [ "/bin/sh","-c","ray stop" ]
# securityContext: ...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-A100-SXM4-80GB
resources:
limits:
cpu: 15
nvidia.com/gpu: 2
memory: 200Gi
requests:
cpu: 15
nvidia.com/gpu: 2
memory: 200Gi
# volumes: ...
# volumeMounts: ....
four-A100-80G-gpu-WG:
replicas: 0
minReplicas: 0
maxReplicas: 2
rayStartParams:
block: 'true'
num-gpus: '4'
resources: '"{\"NVIDIA-A100-SXM4-80GB\": 4}"'
containerEnv:
- name: OMP_NUM_THREADS
value: "1"
- name: OPENBLAS_NUM_THREADS
value: "1"
lifecycle:
preStop:
exec:
command: [ "/bin/sh","-c","ray stop" ]
# securityContext: ...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-A100-SXM4-80GB
resources:
limits:
cpu: 30
nvidia.com/gpu: 4
memory: 400Gi
requests:
cpu: 30
nvidia.com/gpu: 4
memory: 400Gi
# volumes: ...
# volumeMounts: ....
eight-A100-80G-gpu-WG:
replicas: 0
minReplicas: 0
maxReplicas: 4
rayStartParams:
block: 'true'
num-gpus: '8'
resources: '"{\"NVIDIA-A100-SXM4-80GB\": 8, \"full-worker\": 1}"'
containerEnv:
- name: OMP_NUM_THREADS
value: "1"
- name: OPENBLAS_NUM_THREADS
value: "1"
lifecycle:
preStop:
exec:
command: [ "/bin/sh","-c","ray stop" ]
# securityContext: ...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-A100-SXM4-80GB
resources:
limits:
cpu: 60
nvidia.com/gpu: 8
memory: 800Gi
requests:
cpu: 60
nvidia.com/gpu: 8
memory: 800Gi
# volumes: ...
# volumeMounts: ....
Note
Notice that the only variant with a full-worker custom resource is the one with 8 GPUs. Some actuators, like SFTTrainer, use this custom resource for measurements that involve reserving an entire GPU node.
We provide an example set of values for deploying a RayCluster via KubeRay. To deploy it, simply run:
helm upgrade --install ado-ray kuberay/ray-cluster --version 1.1.0 --values backend/kuberay/vanilla-ray.yaml
Feel free to customize it to suit your cluster, such as uncommenting GPU-enabled workers.