Measure throughput of finetuning on a Remote RayCluster¶
Note
This example illustrates:
-
Set a remote RayCluster environment for running finetuning performance benchmarks with SFTTrainer
-
Benchmarking a set of finetuning configurations using GPUs on a remote RayCluster
The scenario¶
When you run a finetuning workload, you can choose values for parameters like the model name, batch size, and number of GPUs. To understand how these choices affect performance, a common strategy is to measure changes in system behavior by exploring the workload parameter space.
This approach applies to many machine learning workloads where performance depends on configuration.
In this example, ado is used to explore LLM fine-tuning throughput across a fine-tuning workload parameter space on a remote RayCluster.
To explore this space, you will:
- define the parameters to test - such as the batch size and the model max length
- define what to test them with - In this case we will use SFTTrainer's
finetune_full_benchmark-v1.0.0experiment - explore the parameter space - the sampling method
Info
This example assumes you have already followed the Measure throughput of finetuning locally example.
Pre-requisites¶
-
A remote shared context is available (see shared contexts for more information). Here we call it
finetuningbut it can have any name. -
A remote RayCluster with a GPU worker with at least one
NVIDIA-A100-SXM4-80GBGPU. TheRayClustershould also include the NVIDIA development and runtime packages. We recommend deploying the RayCluster following our documentation. Ensure that the base virtual environment on your Ray GPU workers meets the requirements offms-hf-tuning==3.0.0: a) Python 3.11 and b)torch==2.6.0pre-installed. If you are using a RayCluster on Kubernetes, we recommend using the image:quay.io/ado/ado:c6ba952ad79a2d86d1174fd9aaebddd8953c78cf-py311-cu121-ofed2410v1140. -
If you host your RayCluster on Kubernetes or OpenShift, make sure you're logged in to the Kubernetes or Openshift cluster.
-
Activate the
finetuningshared context for the example.
ado context finetuning
Install and Configure the SFTTrainer actuator¶
Install the SFTTrainer actuator¶
pip install ado-sfttrainer`
Info
This step assumes you are in the root directory of the ado source repository.
If you haven't already installed the SFTTrainer actuator, run (assumes you are in the root directory of ado):
pip install plugins/actuators/sfttrainer
then executing
ado get actuators
should show an entry for SFTTrainer like below
ACTUATOR ID
0 custom_experiments
1 mock
2 replay
3 SFTTrainer
Configure the SFTTrainer Actuator¶
SFTTrainer includes parameters that control its behavior. For example, it pushes any training metrics it collects, like system profiling metadata, to an AIM server by default. It also features parameters that define important paths, such as the location of the Hugging Face cache and the directory where the actuator expects to find files like the test Dataset.
In this section you will configure the actuator for experiments on your remote RayCluster.
Create the file actuator_configuration.yaml with the following contents:
actuatorIdentifier: SFTTrainer
parameters:
hf_home: /hf-models-pvc/huggingface_home
data_directory: /data/fms-hf-tuning/artificial-dataset/
Create the file actuator_configuration.yaml with the following contents:
actuatorIdentifier: SFTTrainer
parameters:
aim_db: aim://$the-aim-server-domain-or-ip:port
aim_dashboard_url: https://$the-aim-dashboard-domain-or-ip:port
hf_home: /hf-models-pvc/huggingface_home
data_directory: /data/fms-hf-tuning/artificial-dataset/
Info
If you have deployed a custom RayCluster then make sure that the hf_home and data_directory parameters point to paths that can be created by your remote RayCluster workers. We recommend deploying a remote RayCluster following our instructions.
Next, create the actuatorconfiguration resource like so:
ado create actuatorconfiguration -f actuator_configuration.yaml
The command will print the ID of the resource. Make a note of it, you will need it in a later step.
See the full list of the actuator parameters you can set in the SFTTrainer reference docs.
Prepare the remote RayCluster¶
Info
This section assumes you have configured your RayCluster for use with SFTTrainer and that you have configured your SFTTrainer actuator with the values we provided above for the hf_home and data_directory parameters.
For RayClusters on Kubernetes/OpenShift - create a port-forward¶
Info
If your remote RayCluster is not hosted on Kubernetes or OpenShift then you can skip this step.
In a terminal, start a kubectl port-forward process to the service that connects to the head of your RayCluster. Keep this process running until your experiments finish.
For example, if the name of your RayCluster is ray-disorch, run:
kubectl port-forward svc/ray-disorch-head-svc 8265
Verify that the port forward is active by visiting http://localhost:8265 you should see the landing page of the Ray web dashboard.
Prepare files for the Ray jobs you will run later¶
Create a directory called my-remote-measurements and cd into it. You will keep all the files for this example in there.
Similar to how you installed ado and SFTTrainer on your laptop, it's important to ensure these Python packages are also available on your remote RayCluster.
You have two options for installing the required packages:
- Pre-install the packages in the virtual environment of your RayCluster before deployment.
- Use a Ray Runtime environment YAML, which instructs Ray to dynamically install Python packages during runtime.
In this section, we’ll focus on the second approach.
Create the ray_runtime_env.yaml file under the directory my-remote-measurements with the following contents:
pip:
- sfttrainer
env_vars:
env_vars:
AIM_UI_TELEMETRY_ENABLED: "0"
# We set HOME to /tmp because "import aim.utils.tracking" tries to write under $HOME/.aim_profile.
# However, the process lacks permissions to do so and that leads to an ImportError exception.
HOME: "/tmp/"
OMP_NUM_THREADS: "1"
OPENBLAS_NUM_THREADS: "1"
RAY_AIR_NEW_PERSISTENCE_MODE: "0"
PYTHONUNBUFFERED: "x"
If your RayCluster doesn't already have ado installed in its virtual environment then include the ado-core package too.
Build the python wheel for the Actuator plugin SFTTrainer.
Briefly, if you are in the top level of the ado repository execute:
python -m build -w plugins/actuators/sfttrainer
mv plugins/actuators/sfttrainer/dist/*.whl ${path to my-remote-measurements}
Then create a ray_runtime_env.yaml file under my-remote-measurements with the following contents:
pip:
- ${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}/sfttrainer-1.1.0.dev152+g23c7ba34e-py3-none-any.whl
env_vars:
env_vars:
AIM_UI_TELEMETRY_ENABLED: "0"
# We set HOME to /tmp because "import aim.utils.tracking" tries to write under $HOME/.aim_profile.
# However, the process lacks permissions to do so and that leads to an ImportError exception.
HOME: "/tmp/"
OMP_NUM_THREADS: "1"
OPENBLAS_NUM_THREADS: "1"
RAY_AIR_NEW_PERSISTENCE_MODE: "0"
PYTHONUNBUFFERED: "x"
If your RayCluster doesn't already have ado installed in its virtual environment then build the wheel for ado-core by repeating the above in the root directory of ado. Then add an entry under pip pointing to the the resulting ado wheel file.
Info
Your wheel will filenames may vary.
For convenience, you can run the script below from inside the my-remote-measurements directory. It will build the wheels of both ado and sfttrainer and then automatically generate the ray_runtime_env.yaml file under your working directory. The script builds the wheel for ado too.
$path_to_ado_root/plugins/actuators/sfttrainer/examples/build_wheels.sh
You will use the files you created during this step in later steps when launching jobs on your remote RayCluster.
Create the test Dataset on the remote RayCluster¶
Use the .whl and ray_runtime_env.yaml files with ray job submit to launch a job on your remote RayCluster. This job will create the synthetic dataset and place it in the correct location under the directory specified by the data_directory parameter of the SFTTrainer actuator.
Info
You can find instructions for generating the .whl and ray_runtime_env.yaml files in the Prepare files for the Ray jobs you will run later section.
To submit the job to your remote RayCluster run the command:
ray job submit --address http://localhost:8265 --runtime-env ray_runtime_env.yaml \
--working-dir $PWD -v -- sfttrainer_generate_dataset_text \
-o /data/fms-hf-tuning/artificial-dataset/news-tokens-16384plus-entries-4096.jsonl
Reference docs on creating the datasets
Download model weights on the remote RayCluster¶
Next, submit a ray job that downloads the model weights for granite-3.1-2b in the appropriate path under the directory specified by the hf_home parameter of the SFTTrainer actuator.
First, save the following YAML to a file models.yaml inside your working directory (my-remote-measurements):
granite-3.1-2b:
Vanilla: ibm-granite/granite-3.1-2b-base
To start the ray job run:
ray job submit --address http://localhost:8265 --runtime-env ray_runtime_env.yaml \
--working-dir $PWD -v -- \
sfttrainer_download_hf_weights -i models.yaml -o /hf-models-pvc/huggingface_home
Reference docs on pre-fetching weights
Run the example¶
Define the finetuning workload configurations to test and how to test them¶
In this example, we create a discoveryspace that runs the finetune_full_benchmark-v1.0.0 experiment to finetune the granite-3.1-2b using 1 GPU.
The entitySpace defined below includes four dimensions:
model_nameandnumber_gpuseach contain a single value.model_max_lengthandbatch_sizeeach contain two values.
The total number of entities in the entitySpace is the number of unique combinations of values across all dimensions. In this case, the configuration contains 4 entities.
You can find the complete list of the entity space properties in the documentation of the finetune_full_benchmark-v1.0.0 experiment.
- Create the file
space.yamlwith the contents:
experiments:
- experimentIdentifier: finetune_full_benchmark-v1.0.0
actuatorIdentifier: SFTTrainer
parameterization:
- property:
identifier: fms_hf_tuning_version
value: "3.0.0"
- property:
identifier: stop_after_seconds
# Set training duration to at least 30 seconds.
# For meaningful system metrics, we recommend a minimum of 300 seconds.
value: 30
entitySpace:
- identifier: "model_name"
propertyDomain:
values: ["granite-3.1-2b"]
- identifier: "number_gpus"
propertyDomain:
values: [1]
- identifier: "gpu_model"
propertyDomain:
values: ["NVIDIA-A100-SXM4-80GB"]
- identifier: "model_max_length"
propertyDomain:
values: [512, 1024]
- identifier: "batch_size"
propertyDomain:
values: [1, 2]
- Create the space:
-
If you have an
samplestoreID, run:ado create space -f space.yaml --set "sampleStoreIdentifier=$SAMPLE_STORE_IDENTIFIER" -
If you do not have a
samplestorethen runado create space -f space.yaml --new-sample-store
This will print a discoveryspace ID (e.g., space-ea937f-831dba). Make a note of this ID, you'll need it in the next step.
Create a random walk operation to explore the space¶
-
Create the file
operation.yamlwith the following contents:spaces: - The identifier of the DiscoverySpace resource actuatorConfigurationIdentifiers: - The identifier of the Actuator Configuration resource operation: module: operatorName: "random_walk" operationType: "search" parameters: numberEntities: all singleMeasurement: True batchSize: 1 # you may increase this number if you have more than 1 GPU samplerConfig: mode: sequential samplerType: generator -
Replace the placeholders with your
discoveryspaceID andactuatorconfigurationID and save it in a file with the nameoperation.yaml. -
Export the
finetuningcontext so you can supply it to the remote operation.ado get context --output yaml finetuning >context.yaml -
For the next step your
my-remote-measurementsdirectory needs the following files, although the wheels may have different ids.my-remote-measurements ├── ado_core-1.1.0.dev133+f4b639c1.dirty-py3-none-any.whl ├── context.yaml ├── operation.yaml ├── ray_runtime_env.yaml └── ado_sfttrainer-1.1.0.dev133+gf4b639c10.d20250812-py3-none-any.whlInfo
You can find instructions for generating the
.whlandray_runtime_env.yamlfiles in the Prepare files for the Ray jobs you will run later section. -
Create the operation on the remote RayCluster
Use the
.whlandray_runtime_env.yamlfiles to submit a job to your remote RayCluster which creates theoperationthat runs your finetuning measurements.Run the command:
ray job submit --no-wait --address http://localhost:8265 --working-dir . \ --runtime-env ray_runtime_env.yaml -v -- \ ado -c context.yaml create operation -f operation.yamlThe operation will execute the measurements (i.e. apply the experiment finetune_full_benchmark-v1.0.0 on the 4 entities) as defined in your
discoveryspace.Info
Each measurement finetunes the
granite-3.1-2bmodel and takes about two minutes to complete. There is a total of four measurements. It will also take a couple of minutes for Ray to create the ray environment on participating GPU worker nodes so expect theoperationto take O(10) minutes to complete.Reference docs for submitting ado operations to remote RayClusters.
Examine the results of the exploration¶
After the operation completes, you can download the results of your measurements:
ado show entities --output-format csv --property-format=target space $yourDiscoverySpaceID
Info
Notice that because the context we are using refers to a remote project we can access the data created by the operation on the remote ray cluster. Anyone that has access to the finetuning context can also download the results of your measurements!
The command will generate a CSV file. Open it to explore the data that your operation has collected!
It should look similar to this:
,identifier,generatorid,experiment_id,model_name,number_gpus,gpu_model,model_max_length,batch_size,gpu_compute_utilization_min,gpu_compute_utilization_avg,gpu_compute_utilization_max,gpu_memory_utilization_min,gpu_memory_utilization_avg,gpu_memory_utilization_max,gpu_memory_utilization_peak,gpu_power_watts_min,gpu_power_watts_avg,gpu_power_watts_max,gpu_power_percent_min,gpu_power_percent_avg,gpu_power_percent_max,cpu_compute_utilization,cpu_memory_utilization,train_runtime,train_samples_per_second,train_steps_per_second,dataset_tokens_per_second,dataset_tokens_per_second_per_gpu,is_valid
0,model_name.granite-3.1-2b-number_gpus.1-gpu_model.NVIDIA-A100-SXM4-80GB-model_max_length.512-batch_size.2,explicit_grid_sample_generator,SFTTrainer.finetune_full_benchmark-v1.0.0-fms_hf_tuning_version.3.0.0-stop_after_seconds.30,granite-3.1-2b,1,NVIDIA-A100-SXM4-80GB,512,2,40.0,40.0,40.0,25.49774175,25.49774175,25.49774175,31.498108,169.3855,169.3855,169.3855,42.346375,42.346375,42.346375,74.075,2.6414139999999997,31.3457,130.672,16.334,2744.108442306281,2744.108442306281,1
1,model_name.granite-3.1-2b-number_gpus.1-gpu_model.NVIDIA-A100-SXM4-80GB-model_max_length.512-batch_size.1,explicit_grid_sample_generator,SFTTrainer.finetune_full_benchmark-v1.0.0-fms_hf_tuning_version.3.0.0-stop_after_seconds.30,granite-3.1-2b,1,NVIDIA-A100-SXM4-80GB,512,1,27.25,27.25,27.25,25.71380625,25.71380625,25.71380625,31.786194,141.78924999999998,141.78924999999998,141.78924999999998,35.447312499999995,35.447312499999995,35.447312499999995,74.325,2.6420105,30.5903,133.899,33.475,1405.9358685596414,1405.9358685596414,1
2,model_name.granite-3.1-2b-number_gpus.1-gpu_model.NVIDIA-A100-SXM4-80GB-model_max_length.1024-batch_size.1,explicit_grid_sample_generator,SFTTrainer.finetune_full_benchmark-v1.0.0-fms_hf_tuning_version.3.0.0-stop_after_seconds.30,granite-3.1-2b,1,NVIDIA-A100-SXM4-80GB,1024,1,43.25,43.25,43.25,25.43853775,25.43853775,25.43853775,31.498108,181.79625,181.79625,181.79625,45.4490625,45.4490625,45.4490625,74.32499999999999,2.64201475,30.6802,133.506,33.377,2670.126009608803,2670.126009608803,1
3,model_name.granite-3.1-2b-number_gpus.1-gpu_model.NVIDIA-A100-SXM4-80GB-model_max_length.1024-batch_size.2,explicit_grid_sample_generator,SFTTrainer.finetune_full_benchmark-v1.0.0-fms_hf_tuning_version.3.0.0-stop_after_seconds.30,granite-3.1-2b,1,NVIDIA-A100-SXM4-80GB,1024,2,63.75,63.75,63.75,25.67718525,25.67718525,25.67718525,31.737366,238.53825,238.53825,238.53825,59.6345625,59.6345625,59.6345625,74.1,2.6399939999999997,30.1566,135.824,16.978,5161.324552502603,5161.324552502603,1
In the above CSV file you will find 1 column per:
- entity space property (input to the experiment) such as
batch_sizeandmodel_max_length - measured property (output to the experiment) such as
dataset_tokens_per_second_per_gpuandgpu_memory_utilization_peak
For a complete list of the entity space properties check out the documentation for the finetune_full_benchmark-v1.0.0 experiment in the SFTTrainer docs. The complete list of measured properties is available there too.
Next steps¶
-
🔬️ Find out more about the SFTTrainer actuator
The actuator supports several experiments, each with a set of configurable parameters.