Skip to content

Measure throughput of finetuning on a Remote RayCluster

Note

This example illustrates:

  1. Set a remote RayCluster environment for running finetuning performance benchmarks with SFTTrainer

  2. Benchmarking a set of finetuning configurations using GPUs on a remote RayCluster

The scenario

When you run a finetuning workload, you can choose values for parameters like the model name, batch size, and number of GPUs. To understand how these choices affect performance, a common strategy is to measure changes in system behavior by exploring the workload parameter space.

This approach applies to many machine learning workloads where performance depends on configuration.

In this example, ado is used to explore LLM fine-tuning throughput across a fine-tuning workload parameter space on a remote RayCluster.

To explore this space, you will:

  • define the parameters to test - such as the batch size and the model max length
  • define what to test them with - In this case we will use SFTTrainer's finetune_full_benchmark-v1.0.0 experiment
  • explore the parameter space - the sampling method

Info

This example assumes you have already followed the Measure throughput of finetuning locally example.

Pre-requisites

  1. A remote shared context is available (see shared contexts for more information). Here we call it finetuning but it can have any name.

  2. A remote RayCluster with a GPU worker that has at least one NVIDIA-A100-SXM4-80GB GPU. The RayCluster should also contain the NVIDIA development and runtime packages. We recommend deploying the RayCluster following our docs.

  3. If you host your RayCluster on Kubernetes or OpenShift, make sure you're logged in to the Kubernetes or Openshift cluster.

  4. Activate the finetuning shared context for the example.

    ado context finetuning
    

Install and Configure the SFTTrainer actuator

Install the SFTTrainer actuator

Run pip install ado-sfttrainer to install the SFTTrainer actuator plugin using the wheel that we push to PyPi.

Info

We are currently in the process of opensourcing ado so the above wheel may not exist on PyPi. If that is the case when you try out this example, please follow the instructions under the Build the python wheel yourself tab instead.

Info

This step assumes you are in the root directory of the ado source repository.

If you haven't already installed the SFTTrainer actuator, run (assumes you are in the root directory of ado):

pip install plugins/actuators/sfttrainer

then executing

ado get actuators

should show an entry for SFTTrainer like below

           ACTUATOR ID
0   custom_experiments
1                 mock
2               replay
3           SFTTrainer

Configure the SFTTrainer Actuator

SFTTrainer includes parameters that control its behavior. For example, it pushes any training metrics it collects, like system profiling metadata, to an AIM server by default. It also features parameters that define important paths, such as the location of the Hugging Face cache and the directory where the actuator expects to find files like the test Dataset.

In this section you will configure the actuator for experiments on your remote RayCluster.

Create the file actuator_configuration.yaml with the following contents:

actuatorIdentifier: SFTTrainer
parameters:
  hf_home: /hf-models-pvc/huggingface_home
  data_directory: /data/fms-hf-tuning/artificial-dataset/

Create the file actuator_configuration.yaml with the following contents:

actuatorIdentifier: SFTTrainer
parameters:
  aim_db: aim://$the-aim-server-domain-or-ip:port
  aim_dashboard_url: https://$the-aim-dashboard-domain-or-ip:port
  hf_home: /hf-models-pvc/huggingface_home
  data_directory: /data/fms-hf-tuning/artificial-dataset/

Info

If you have deployed a custom RayCluster then make sure that the hf_home and data_directory parameters point to paths that can be created by your remote RayCluster workers. We recommend deploying a remote RayCluster following our instructions.

Next, create the actuatorconfiguration resource like so:

ado create actuatorconfiguration -f actuator_configuration.yaml

The command will print the ID of the resource. Make a note of it, you will need it in a later step.

See the full list of the actuator parameters you can set in the SFTTrainer reference docs.

Prepare the remote RayCluster

Info

This section assumes you have configured your RayCluster for use with SFTTrainer and that you have configured your SFTTrainer actuator with the values we provided above for the hf_home and data_directory parameters.

For RayClusters on Kubernetes/OpenShift - create a port-forward

Info

If your remote RayCluster is not hosted on Kubernetes or OpenShift then you can skip this step.

In a terminal, start a kubectl port-forward process to the service that connects to the head of your RayCluster. Keep this process running until your experiments finish.

For example, if the name of your RayCluster is ray-disorch, run:

kubectl port-forward svc/ray-disorch-head-svc 8265

Verify that the port forward is active by visiting http://localhost:8265 you should see the landing page of the Ray web dashboard.

Prepare files for the Ray jobs you will run later

Create a directory called my-remote-measurements and cd into it. You will keep all the files for this example in there.

Similar to how you installed ado and SFTTrainer on your laptop, it's important to ensure these Python packages are also available on your remote RayCluster.

You have two options for installing the required packages:

  1. Pre-install the packages in the virtual environment of your RayCluster before deployment.
  2. Use a Ray Runtime environment YAML, which instructs Ray to dynamically install Python packages during runtime.

In this section, we’ll focus on the second approach.

Info

We are currently in the process of opensourcing ado so the above wheel may not exist on PyPi. If that is the case when you try out this example, please follow the instructions under the Build the python wheel yourself tab instead.

Create the ray_runtime_env.yaml file under the directory my-remote-measurements with the following contents:

pip:
   - sfttrainer
env_vars:
  env_vars:
    AIM_UI_TELEMETRY_ENABLED: "0"
    # We set HOME to /tmp because "import aim.utils.tracking" tries to write under $HOME/.aim_profile.
    # However, the process lacks permissions to do so and that leads to an ImportError exception.
    HOME: "/tmp/"
    OMP_NUM_THREADS: "1"
    OPENBLAS_NUM_THREADS: "1"
    RAY_AIR_NEW_PERSISTENCE_MODE: "0"
    PYTHONUNBUFFERED: "x"

If your RayCluster doesn't already have ado installed in its virtual environment then include the adorchesrator package too.

Build the python wheel for the Actuator plugin SFTTrainer.

Briefly, if you are in the top level of the ado repository execute:

python -m build -w plugins/actuators/sfttrainer
mv plugins/actuators/sfttrainer/dist/*.whl ${path to my-remote-measurements}

Then create a ray_runtime_env.yaml file under my-remote-measurements with the following contents:

pip:
   - ${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}/sfttrainer-1.1.0.dev152+g23c7ba34e-py3-none-any.whl
env_vars:
   env_vars:
   AIM_UI_TELEMETRY_ENABLED: "0"
   # We set HOME to /tmp because "import aim.utils.tracking" tries to write under $HOME/.aim_profile.
   # However, the process lacks permissions to do so and that leads to an ImportError exception.
   HOME: "/tmp/"
   OMP_NUM_THREADS: "1"
   OPENBLAS_NUM_THREADS: "1"
   RAY_AIR_NEW_PERSISTENCE_MODE: "0"
   PYTHONUNBUFFERED: "x"

If your RayCluster doesn't already have ado installed in its virtual environment then build the wheel for ado-core by repeating the above in the root directory of ado. Then add an entry under pip pointing to the the resulting ado wheel file.

Info

Your wheel will filenames may vary.

For convenience, you can run the script below from inside the my-remote-measurements directory. It will build the wheels of both ado and sfttrainer and then automatically generate the ray_runtime_env.yaml file under your working directory. The script builds the wheel for ado too.

$path_to_ado_root/plugins/actuators/sfttrainer/examples/build_wheels.sh

Reference docs on using ado with remote RayClusters.

You will use the files you created during this step in later steps when launching jobs on your remote RayCluster.

Create the test Dataset on the remote RayCluster

Use the .whl and ray_runtime_env.yaml files with ray job submit to launch a job on your remote RayCluster. This job will create the synthetic dataset and place it in the correct location under the directory specified by the data_directory parameter of the SFTTrainer actuator.

Info

You can find instructions for generating the .whl and ray_runtime_env.yaml files in the Prepare files for the Ray jobs you will run later section.

To submit the job to your remote RayCluster run the command:

ray job submit --address http://localhost:8265 --runtime-env ray_runtime_env.yaml \
--working-dir $PWD -v -- sfttrainer_generate_dataset_text \
-o /data/fms-hf-tuning/artificial-dataset/news-tokens-16384plus-entries-4096.jsonl

Reference docs on creating the datasets

Download model weights on the remote RayCluster

Next, submit a ray job that downloads the model weights for granite-3.1-2b in the appropriate path under the directory specified by the hf_home parameter of the SFTTrainer actuator.

First, save the following YAML to a file models.yaml inside your working directory (my-remote-measurements):

granite-3.1-2b:
  Vanilla: ibm-granite/granite-3.1-2b-base

To start the ray job run:

ray job submit --address http://localhost:8265 --runtime-env ray_runtime_env.yaml \
--working-dir $PWD -v -- \
sfttrainer_download_hf_weights -i models.yaml -o /hf-models-pvc/huggingface_home

Reference docs on pre-fetching weights

Run the example

Define the finetuning workload configurations to test and how to test them

In this example, we create a discoveryspace that runs the finetune_full_benchmark-v1.0.0 experiment to finetune the granite-3.1-2b using 1 GPU.

The entitySpace defined below includes four dimensions:

  • model_name and number_gpus each contain a single value.
  • model_max_length and batch_size each contain two values.

The total number of entities in the entitySpace is the number of unique combinations of values across all dimensions. In this case, the configuration contains 4 entities.

You can find the complete list of the entity space properties in the documentation of the finetune_full_benchmark-v1.0.0 experiment.

  1. Create the file space.yaml with the contents:

    experiments:
    - experimentIdentifier: finetune_full_benchmark-v1.0.0
      actuatorIdentifier: SFTTrainer
      parameterization:
        - property:
            identifier: fms_hf_tuning_version
          value: "2.8.2"
        - property:
            identifier: stop_after_seconds
          # Set training duration to at least 30 seconds.
          # For meaningful system metrics, we recommend a minimum of 300 seconds.
          value: 30
    
    entitySpace:
    - identifier: "model_name"
      propertyDomain:
        values: [ "granite-3.1-2b" ]
    - identifier: "number_gpus"
      propertyDomain:
        values: [ 1 ]
    - identifier: "gpu_model"
      propertyDomain:
        values: ["NVIDIA-A100-SXM4-80GB"]
    - identifier: "model_max_length"
      propertyDomain:
        values: [ 512, 1024 ]
    - identifier: "batch_size"
      propertyDomain:
        values: [ 1, 2 ] 
    
  2. Create the space:

    • If you have an samplestore ID, run:

      ado create space -f space.yaml --set "sampleStoreIdentifier=$SAMPLE_STORE_IDENTIFIER"
      
      - If you do not have a samplestore then run

      ado create space -f space.yaml --new-sample-store
      

    This will print a discoveryspace ID (e.g., space-ea937f-831dba). Make a note of this ID, you'll need it in the next step.

Create a random walk operation to explore the space

  1. Create the file operation.yaml with the following contents:

    spaces:
    - The identifier of the DiscoverySpace resource
    actuatorConfigurationIdentifiers:
    - The identifier of the Actuator Configuration resource
    
    operation:
      module:
        moduleClass: "RandomWalk"
      parameters:
        numberEntities: all
        singleMeasurement: True
        mode: sequential
        samplerType: generator
        batchSize: 1 # you may increase this number if you have more than 1 GPU
    
  2. Replace the placeholders with your discoveryspace ID and actuatorconfiguration ID and save it in a file with the name operation.yaml.

  3. Export the finetuning context so you can supply it to the remote operation.

    ado get context --output yaml finetuning >context.yaml
    
  4. For the next step your my-remote-measurements directory needs the following files, although the wheels may have different ids.

    my-remote-measurements
    ├── ado_core-1.1.0.dev133+f4b639c1.dirty-py3-none-any.whl
    ├── context.yaml
    ├── operation.yaml
    ├── ray_runtime_env.yaml
    └── ado_sfttrainer-1.1.0.dev133+gf4b639c10.d20250812-py3-none-any.whl
    

    Info

    You can find instructions for generating the .whl and ray_runtime_env.yaml files in the Prepare files for the Ray jobs you will run later section.

  5. Create the operation on the remote RayCluster

    Use the .whl and ray_runtime_env.yaml files to submit a job to your remote RayCluster which creates the operation that runs your finetuning measurements.

    Run the command:

    ray job submit --no-wait --address http://localhost:8265  --working-dir . \
    --runtime-env ray_runtime_env.yaml -v -- \
    ado -c context.yaml create operation -f operation.yaml
    

    The operation will execute the measurements (i.e. apply the experiment finetune_full_benchmark-v1.0.0 on the 4 entities) as defined in your discoveryspace.

    Info

    Each measurement finetunes the granite-3.1-2b model and takes about two minutes to complete. There is a total of four measurements. It will also take a couple of minutes for Ray to create the ray environment on participating GPU worker nodes so expect the operation to take O(10) minutes to complete.

    Reference docs for submitting ado operations to remote RayClusters.

Examine the results of the exploration

After the operation completes, you can download the results of your measurements:

ado show entities --output-format csv --property-format=target space $yourDiscoverySpaceID

Info

Notice that because the context we are using refers to a remote project we can access the data created by the operation on the remote ray cluster. Anyone that has access to the finetuning context can also download the results of your measurements!

The command will generate a CSV file. Open it to explore the data that your operation has collected!

It should look similar to this:

,identifier,generatorid,experiment_id,model_name,number_gpus,gpu_model,model_max_length,batch_size,gpu_compute_utilization_min,gpu_compute_utilization_avg,gpu_compute_utilization_max,gpu_memory_utilization_min,gpu_memory_utilization_avg,gpu_memory_utilization_max,gpu_memory_utilization_peak,gpu_power_watts_min,gpu_power_watts_avg,gpu_power_watts_max,gpu_power_percent_min,gpu_power_percent_avg,gpu_power_percent_max,cpu_compute_utilization,cpu_memory_utilization,train_runtime,train_samples_per_second,train_steps_per_second,dataset_tokens_per_second,dataset_tokens_per_second_per_gpu,is_valid
0,model_name.granite-3.1-2b-number_gpus.1-gpu_model.NVIDIA-A100-SXM4-80GB-model_max_length.512-batch_size.2,explicit_grid_sample_generator,SFTTrainer.finetune_full_benchmark-v1.0.0-fms_hf_tuning_version.2.8.2-stop_after_seconds.30,granite-3.1-2b,1,NVIDIA-A100-SXM4-80GB,512,2,40.0,40.0,40.0,25.49774175,25.49774175,25.49774175,31.498108,169.3855,169.3855,169.3855,42.346375,42.346375,42.346375,74.075,2.6414139999999997,31.3457,130.672,16.334,2744.108442306281,2744.108442306281,1
1,model_name.granite-3.1-2b-number_gpus.1-gpu_model.NVIDIA-A100-SXM4-80GB-model_max_length.512-batch_size.1,explicit_grid_sample_generator,SFTTrainer.finetune_full_benchmark-v1.0.0-fms_hf_tuning_version.2.8.2-stop_after_seconds.30,granite-3.1-2b,1,NVIDIA-A100-SXM4-80GB,512,1,27.25,27.25,27.25,25.71380625,25.71380625,25.71380625,31.786194,141.78924999999998,141.78924999999998,141.78924999999998,35.447312499999995,35.447312499999995,35.447312499999995,74.325,2.6420105,30.5903,133.899,33.475,1405.9358685596414,1405.9358685596414,1
2,model_name.granite-3.1-2b-number_gpus.1-gpu_model.NVIDIA-A100-SXM4-80GB-model_max_length.1024-batch_size.1,explicit_grid_sample_generator,SFTTrainer.finetune_full_benchmark-v1.0.0-fms_hf_tuning_version.2.8.2-stop_after_seconds.30,granite-3.1-2b,1,NVIDIA-A100-SXM4-80GB,1024,1,43.25,43.25,43.25,25.43853775,25.43853775,25.43853775,31.498108,181.79625,181.79625,181.79625,45.4490625,45.4490625,45.4490625,74.32499999999999,2.64201475,30.6802,133.506,33.377,2670.126009608803,2670.126009608803,1
3,model_name.granite-3.1-2b-number_gpus.1-gpu_model.NVIDIA-A100-SXM4-80GB-model_max_length.1024-batch_size.2,explicit_grid_sample_generator,SFTTrainer.finetune_full_benchmark-v1.0.0-fms_hf_tuning_version.2.8.2-stop_after_seconds.30,granite-3.1-2b,1,NVIDIA-A100-SXM4-80GB,1024,2,63.75,63.75,63.75,25.67718525,25.67718525,25.67718525,31.737366,238.53825,238.53825,238.53825,59.6345625,59.6345625,59.6345625,74.1,2.6399939999999997,30.1566,135.824,16.978,5161.324552502603,5161.324552502603,1

In the above CSV file you will find 1 column per:

  • entity space property (input to the experiment) such as batch_size and model_max_length
  • measured property (output to the experiment) such as dataset_tokens_per_second_per_gpu and gpu_memory_utilization_peak

For a complete list of the entity space properties check out the documentation for the finetune_full_benchmark-v1.0.0 experiment in the SFTTrainer docs. The complete list of measured properties is available there too.

Next steps: