SFTTrainer - measure fine-tuning performance
Overview¶
The SFTTrainer
actuator provides a flexible and scalable interface for running supervised fine-tuning (SFT) experiments on large language and vision-language models. It supports a variety of fine-tuning strategies including full fine-tuning, LoRA, QPTQ-LoRA, and prompt-tuning across both text-to-text and image-to-text datasets.
Designed for high-performance and distributed environments, SFTTrainer
supports:
- Single-GPU, multi-GPU, and multi-node training
- Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP) strategies
- RDMA over Converged Ethernet (RoCE) for optimized multi-node communication
- Ray-based task scheduling, enabling execution on both Kubernetes clusters and bare-metal infrastructure
Under the hood, this actuator wraps the fms-hf-tuning library, which itself builds on the SFTTrainer
API from Hugging Face Transformers. This layered design allows users to leverage the robustness of the Hugging Face ecosystem while benefiting from ado’s orchestration and reproducibility features.
Available experiments¶
The SFTTrainer
actuator includes a set of experiments that evaluate different fine-tuning strategies under controlled conditions. These experiments use artificial datasets to ensure reproducibility and comparability across runs. A full list of available experiments and their configurations is available in the README.MD file of the Actuator.
The most frequently used experiments are:
finetune_full_benchmark-v1.0.0¶
Performs full fine-tuning of all model parameters. This experiment is ideal for evaluating end-to-end training performance and resource utilization on large models.
Experiment documentation
An experiment instance:
- performs full fine tuning
- You may notice that even large-memory GPUs like the 80GB variant of the NVIDIA A100 chip need at least 2 GPUs to train models as big as 13B parameters.
- the training data is artificial
use_flash_attn
is set to Truepacking
is set to Falsetorch_dtype
is set tobfloat16
by default, can also be float16- uses the
FSDP
distributed backend for multi-gpu runs by default, can also beDDP
- multi-gpu runs with FSDP and DDP backends use 1 process per GPU (via
accelerate
) - runs 1 epoch by default, can also run a custom number of steps
- does not save checkpoint
- loads weights from a PVC
- request 2 CPU cores per GPU device (with a minimum of 2 cores)
For FSDP runs we use the following accelerate_config.yml
YAML file:
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: ${fsdp_sharding_strategy}
fsdp_state_dict_type: ${fsdp_state_dict_type}
fsdp_cpu_ram_efficient_loading: true
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: ${accelerate_config_fsdp_transformer_layer_cls_to_wrap}
machine_rank: {$THE MACHINE RANK - always 0 for single-node runs}
main_training_function: main
mixed_precision: ${accelerate_config_mixed_precision}
num_machines: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: {$SOME_PORT}
num_processes: {$NUM_GPUS}
For DDP runs we use this instead:
compute_environment: LOCAL_MACHINE
debug: False
downcast_bf16: no
distributed_type: MULTI_GPU
machine_rank: {$THE MACHINE RANK - always 0 for single-node runs}
main_training_function: main
mixed_precision: ${accelerate_config_mixed_precision}
num_machines: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: {$SOME_PORT}
num_processes: {$NUM_GPUS}
Commandline:
accelerate launch --config_file ${PATH_ACCELERATE_CONFIG} --num_processes ${NUMBER_GPUS} \
${PATH_TO_OUR_WRAPPER_OF_FMS_HF_TUNING_SFT_TRAINER} --model_name_or_path ${MODEL} \
--torch_dtype bfloat16 --use_flash_attn True --training_data_path ${DATASET_PATH} \
--response_template "\n### Response:" --dataset_text_field output --log_level debug \
--num_train_epochs 1 --per_device_train_batch_size ${BATCH_SIZE/NUM_GPUS} \
--max_seq_length ${MODEL_MAX_LENGTH} --eval_strategy no --output_dir ${RANDOM_DIR} \
--gradient_accumulation_steps ${GRADIENT_ACCUMULATION_STEPS} --save_strategy no \
--learning_rate 1e-05 --weight_decay 0.0 --warmup_ratio 0.03 --lr_scheduler_type cosine \
--logging_steps 1 --include_tokens_per_second True --gradient_checkpointing True \
--packing False --peft_method none --optim ${OPTIM} --bf16 ${BF16} \
--gradient_checkpointing_kwargs='{"use_reentrant": ${GRADIENT_CHECKPOINTING_USE_REENTRANT}}' \
--fast_moe ${FAST_MOE}
Note: --fast_moe
is only supported for fms-hf-tuning v2.4.0+
We use a thin wrapper of sft_trainer.py
which injects a custom Callback that exports the metrics collected by AIM. You can repeat our experiments by just pointing the above command-line to sft_trainer.py
from the fms-hf-tuning
package.
Versioning:
- Actuator version:
2.1.0
- fms-hf-tuning versions:
- 2.8.2
- 2.7.1
- 2.6.0
- 2.5.0
- 2.4.0
- 2.3.1
- 2.2.1
- 2.1.2 (default)
- 2.1.1
- 2.1.0
- 2.0.1
Full Finetuning Requirements¶
- The PVC
hf-models-pvc
mounted under/hf-models-pvc
- should contain the models:- LLaMa/models/hf/13B/
- LLaMa/models/hf/7B/
- LLaMa/models/hf/llama2-70b/
- LLaMa/models/hf/llama3-70b/
- LLaMa/models/hf/llama3-8b/
- LLaMa/models/hf/llama3.1-405b/
- LLaMa/models/hf/llama3.1-70b/
- LLaMa/models/hf/llama3.1-8b/
- Mixtral-8x7B-Instruct-v0.1/
- allam-1-13b-instruct-20240607/
- granite-13b-base-v2/step_300000_ckpt/
- granite-20b-code-base-v2/step_280000_ckpt/
- granite-34b-code-base/
- granite-8b-code-base/
- granite-8b-japanese-base-v1-llama/
- mistralai-mistral-7b-v0.1/
- mistral-large/fp16_240620
- The PVC
ray-disorch-storage
mounted under/data
with the synthetic datasets of the SFTTrainer actuator
Full Finetuning Entity space¶
Required:
- model_name: Supported models:
["granite-3b-1.5", "hf-tiny-model-private/tiny-random-BloomForCausalLM", "llama-7b", "granite-13b-v2", "llama-13b", "granite-20b-v2", "granite-7b-base", "granite-8b-japanese", "granite-8b-code-base", "granite-34b-code-base", "mistral-7b-v0.1", "llama3-8b", "llama3-70b", "mixtral-8x7b-instruct-v0.1", "llama2-70b", "llama3.1-8b", "llama3.1-70b", "llama3.1-405b", "granite-3b-code-base-128k", "granite-8b-code-base-128k", "allam-1-13b", "granite-3-8b", "granite-3.1-2b", "granite-3.1-8b-instruct", "mistral-123b-v2", "granite-3.1-3b-a800m-instruct", "granite-vision-3.2-2b", "smollm2-135m", "llava-v1.6-mistral-7b"]
- model_max_length: Maximum sequence length. Sequences will be right padded (and possibly truncated)
- number_gpus: The effective number of GPUs (to be evenly distributed to
number_nodes
machines) - batch_size: the effective batch_size (will be evenly distributed to max(1, number_gpus) devices)
- gpu_model: The value of the kubernetes node label
nvidia.com/gpu.prod
for exampleNVIDIA-A100-80GB-PCIe
NVIDIA-A100-SXM4-80GB
NVIDIA-H100-PCIe
Optional:
- dataset_id: Default is
news-tokens-16384plus-entries-4096
. Available options are:news-chars-512-entries-4096
: 4096 entries with samples of 512 + 127 (prompt) + 512 charactersnews-chars-1024-entries-4096
: 4096 entries with samples of 1024 + 127 (prompt) + 1024 charactersnews-chars-2048-entries-4096
: 4096 entries with samples of 2048 + 127 (prompt) + 2048 charactersnews-tokens-16384plus-entries-4096
: 4096 entries, each entry has least 16384 tokens when tokenized with any of the granite-13b-v2, llama-13b-v2, llama-7b, or granite-20b-v2 tokenizersvision-384x384-16384plus-entries-4096
: A vision dataset containing 4096 entries. Each entry includes at least 16384 tokens when tokenized withgranite-vision-3.2-2b
, and consists of repeated copies of a single image with dimensions 384×384.vision-384x768-16384plus-entries-4096
: Similar to the above, this dataset also contains 4096 entries with a minimum of 16384 tokens per entry (tokenized usinggranite-vision-3.2-2b
). Each entry uses repeated copies of a single image sized 384×768.
- gradient_checkpointing: Default is
True
. IfTrue
, use gradient checkpointing to save memory (i.e. higher batchsizes) at the expense of slower backward pass - gradient_accumulation_steps: Default is 4. Number of update steps to accumulate before performing a backward/update pass. Only takes effect when gradient_checkpointing is True
- torch_dtype: Default is
bfloat16
. One ofbfloat16
,float32
,float16
- max_steps: Default is
-1
. The number of optimization steps to perform. Set to -1 to respect num_train_epochs instead. - num_train_epochs: Default is
1.0
. How many epochs to run. Ignored if max_steps is greater than 0. - stop_after_seconds: Default is
-1.0
. If set, the optimizer will be asked to stop after the specified time elapses. The check is performed after the end of each training step. - distributed_backend: Default is
FSDP
for multi-gpu measurements,None
(i.e. Data Parallel (DP)) for single-gpu measurements. Which pytorch backend to use when training with multiple GPU devices. - number_nodes: Default is
1
. If set, actuator distributes tasks on multiple nodes. Each Node will use number_gpus/number_nodes GPUs. Each Node will use 1 process for each GPU it uses - fms_hf_tuning_version: Default is
2.1.2
. Which version of fms-hf-tuning to use. Available options are:2.8.2
,2.7.1
,2.6.0
,2.5.0
,2.4.0
,2.3.1
,2.2.1
,2.1.2
,2.1.0
,2.0.1
- enable_roce: Default is
False
. This setting is only in effect for multi-node runs. It controls whether RDMA over Converged Ethernet (RoCE) is switched on or not. - fast_moe: Default is
0
. Configures the amount of expert parallel sharding. number_gpus must be divisible by it - fast_kernels: Default is
None
. Switches on fast kernels, the value is a list with strings of boolean values for[fast_loss, fast_rms_layernorm, fast_rope_embeddings]
- optim: Default is
adamw_torch
. The optimizer to use. Available options areadamw_hf
,adamw_torch
,adamw_torch_fused
,adamw_torch_xla
,adamw_torch_npu_fused
,adamw_apex_fused
,adafactor
,adamw_anyprecision
,adamw_torch_4bit
,ademamix
,sgd
,adagrad
,adamw_bnb_8bit
,adamw_8bit
,ademamix_8bit
,lion_8bit
,lion_32bit
,paged_adamw_32bit
,paged_adamw_8bit
,paged_ademamix_32bit
,paged_ademamix_8bit
,paged_lion_32bit
,paged_lion_8bit
,rmsprop
,rmsprop_bnb
,rmsprop_bnb_8bit
,rmsprop_bnb_32bit
,galore_adamw
,galore_adamw_8bit
,galore_adafactor
,galore_adamw_layerwise
,galore_adamw_8bit_layerwise
,galore_adafactor_layerwise
,lomo
,adalomo
,grokadamw
,schedule_free_adamw
,schedule_free_sgd
- bf16: Default is
False
. Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA add bf16 mixed precision support for NPU architecture or using CPU (use_cpu) or Ascend NPU. This is an experimental API and it may change. Can beTrue
,False
. - gradient_checkpointing_use_reentrant: Default is
False
Specify whether to use the activation checkpoint variant that requires reentrant autograd. This parameter should be passed explicitly. Torch version 2.5 will raise an exception if use_reentrant is not passed. If use_reentrant=False, checkpoint will use an implementation that does not require reentrant autograd. This allows checkpoint to support additional functionality, such as working as expected with torch.autograd.grad and support for keyword arguments input into the checkpointed function. Can beTrue
,False
. - fsdp_sharding_strategy: Default is
FULL_SHARD
. [1] FULL_SHARD (shards optimizer states, gradients and parameters), " [2] SHARD_GRAD_OP (shards optimizer states and gradients), [3] NO_SHARD (DDP), [4] HYBRID_SHARD (shards optimizer states, gradients and parameters within each node while each node has full copy), [5] HYBRID_SHARD_ZERO2 (shards optimizer states and gradients within each node while each node has full copy). For more information, please refer the official PyTorch docs. - fsdp_state_dict_type: Default is
FULL_STATE_DICT
. [1] FULL_STATE_DICT, [2] LOCAL_STATE_DICT, [3] SHARDED_STATE_DICT - fsdp_use_orig_params: Default is
True
. If True, allows non-uniformrequires_grad
during init, which means support for interspersed frozen and trainable parameters. (useful only whenuse_fsdp
flag is passed). - accelerate_config_mixed_precision: Default is
no
. Whether to use mixed precision training or not. Choose fromno
,fp16
,bf16
orfp8
.fp8
requires the installation of transformers-engine. - accelerate_config_fsdp_transformer_layer_cls_to_wrap: Default is None. List of transformer layer class names (case-sensitive) to wrap, e.g,
GraniteDecoderLayer
,LlamaDecoderLayer
,MistralDecoderLayer
,BertLayer
,GPTJBlock
,T5Block
... (useful only when using FSDP) - dataset_text_field: Default is None. Training dataset text field containing single sequence. Either the dataset_text_field or data_formatter_template need to be supplied. For running vision language model tuning pass the column name for text data.
- dataset_image_field: Default is None. For running vision language model tuning pass the column name of the image data in the dataset.
- remove_unused_columns: Default is True. Remove columns not required by the model when using an nlp.Dataset.
- dataset_kwargs_skip_prepare_dataset: Default is False. When True, configures trl to skip preparing the dataset
Info
Because running accelerate
with a single gpu is unsupported, when setting number_gpus
to 1 this experiment actually runs the tuning.sft_trainer
script directly (i.e. a DataParallel (DP) run).
finetune_full_stability-v1.0.0¶
Runs full fine-tuning five times and reports the proportion of tasks that fail due to GPU memory limits, unknown errors, or complete successfully. This experiment is useful for testing the model stability under different configurations.
Experiment documentation
An experiment instance:
- performs full fine-tuning 5 times and reports the fraction of tasks that ran out of GPU memory, exhibited some unknown error, or completed successfully
- You may notice that even large-memory GPUs like the 80GB variant of the NVIDIA A100 chip need at least 2 GPUs to train models as big as 13B parameters.
- the training data is artificial
use_flash_attn
is set to Truepacking
is set to Falsetorch_dtype
is set tobfloat16
- uses the
FSDP
distributed backend - runs 5 optimization steps
- does not save checkpoint
- loads weights from a PVC
- request 2 CPU cores per GPU device (with a minimum of 2 cores)
We use the following accelerate_config.yml
YAML file for all models:
compute_environment: LOCAL_MACHINE
debug: False
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: 1
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_cpu_ram_efficient_loading: true
fsdp_sync_module_states: true
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: {$SOME_PORT}
num_processes: {$NUM_GPUS}
Commandline:
accelerate launch --config_file ${PATH_ACCELERATE_CONFIG} --num_processes ${NUMBER_GPUS} \
${PATH_TO_OUR_WRAPPER_OF_FMS_HF_TUNING_SFT_TRAINER} --model_name_or_path ${MODEL} \
--torch_dtype bfloat16 --use_flash_attn True --training_data_path ${DATASET_PATH} \
--response_template "\n### Response:" --dataset_text_field output --log_level debug \
--max_steps -1 --per_device_train_batch_size ${BATCH_SIZE/NUM_GPUS} \
--max_seq_length ${MODEL_MAX_LENGTH} --eval_strategy no --output_dir ${RANDOM_DIR} \
--gradient_accumulation_steps ${GRADIENT_ACCUMULATION_STEPS} --save_strategy no \
--learning_rate 1e-05 --weight_decay 0.0 --warmup_ratio 0.03 --lr_scheduler_type cosine \
--logging_steps 1 --include_tokens_per_second True --gradient_checkpointing True \
--packing False --peft_method none --optim ${OPTIM} --bf16 ${BF16} \
--gradient_checkpointing_kwargs='{"use_reentrant": ${GRADIENT_CHECKPOINTING_USE_REENTRANT}}' \
--fast_moe ${FAST_MOE}
Note: --fast_moe
is only supported for fms-hf-tuning v2.4.0+
We use a thin wrapper of sft_trainer.py
which injects a custom Callback that exports the metrics collected by AIM. You can repeat our experiments by just pointing the above command-line to sft_trainer.py
from the fms-hf-tuning
package.
Versioning:
- Actuator version:
2.1.0
- fms-hf-tuning versions:
- 2.8.2
- 2.7.1
- 2.6.0
- 2.5.0
- 2.4.0
- 2.3.1
- 2.2.1
- 2.1.2 (default)
- 2.1.1
- 2.1.0
- 2.0.1
Full Finetuning (Stability) Requirements¶
- The PVC
hf-models-pvc
mounted under/hf-models-pvc
- should contain the models:- LLaMa/models/hf/13B/
- LLaMa/models/hf/7B/
- LLaMa/models/hf/llama2-70b/
- LLaMa/models/hf/llama3-70b/
- LLaMa/models/hf/llama3-8b/
- LLaMa/models/hf/llama3.1-405b/
- LLaMa/models/hf/llama3.1-70b/
- LLaMa/models/hf/llama3.1-8b/
- Mixtral-8x7B-Instruct-v0.1/
- allam-1-13b-instruct-20240607/
- granite-13b-base-v2/step_300000_ckpt/
- granite-20b-code-base-v2/step_280000_ckpt/
- granite-34b-code-base/
- granite-8b-code-base/
- granite-8b-japanese-base-v1-llama/
- mistralai-mistral-7b-v0.1/
- mistral-large/fp16_240620
- The PVC
ray-disorch-storage
mounted under/data
with the synthetic datasets of the SFTTrainer actuator
Full Finetuning (Stability) Entity space¶
Required:
- model_name: Supported models:
["granite-3b-1.5", "hf-tiny-model-private/tiny-random-BloomForCausalLM", "llama-7b", "granite-13b-v2", "llama-13b", "granite-20b-v2", "granite-7b-base", "granite-8b-japanese", "granite-8b-code-base", "granite-34b-code-base", "mistral-7b-v0.1", "llama3-8b", "llama3-70b", "mixtral-8x7b-instruct-v0.1", "llama2-70b", "llama3.1-8b", "llama3.1-70b", "llama3.1-405b", "granite-3b-code-base-128k", "granite-8b-code-base-128k", "allam-1-13b", "granite-3-8b", "granite-3.1-2b", "granite-3.1-8b-instruct", "mistral-123b-v2", "granite-3.1-3b-a800m-instruct", "granite-vision-3.2-2b", "smollm2-135m", "llava-v1.6-mistral-7b"]
- model_max_length: Maximum sequence length. Sequences will be right padded (and possibly truncated)
- number_gpus: The effective number of GPUs (to be evenly distributed to
number_nodes
machines) - batch_size: the effective batch_size (will be evenly distributed to max(1, number_gpus) devices)
- gpu_model: The value of the kubernetes node label
nvidia.com/gpu.prod
for exampleNVIDIA-A100-80GB-PCIe
NVIDIA-A100-SXM4-80GB
NVIDIA-H100-PCIe
Optional:
- dataset_id: Default is
news-tokens-16384plus-entries-4096
. Available options are:news-chars-512-entries-4096
: 4096 entries with samples of 512 + 127 (prompt) + 512 charactersnews-chars-1024-entries-4096
: 4096 entries with samples of 1024 + 127 (prompt) + 1024 charactersnews-chars-2048-entries-4096
: 4096 entries with samples of 2048 + 127 (prompt) + 2048 charactersnews-tokens-16384plus-entries-4096
: 4096 entries, each entry has least 16384 tokens when tokenized with any of the granite-13b-v2, llama-13b-v2, llama-7b, or granite-20b-v2 tokenizersvision-384x384-16384plus-entries-4096
: A vision dataset containing 4096 entries. Each entry includes at least 16384 tokens when tokenized withgranite-vision-3.2-2b
, and consists of repeated copies of a single image with dimensions 384×384.vision-384x768-16384plus-entries-4096
: Similar to the above, this dataset also contains 4096 entries with a minimum of 16384 tokens per entry (tokenized usinggranite-vision-3.2-2b
). Each entry uses repeated copies of a single image sized 384×768.
- gradient_checkpointing: Default is
True
. IfTrue
, use gradient checkpointing to save memory (i.e. higher batchsizes) at the expense of slower backward pass - gradient_accumulation_steps: Default is 4. Number of update steps to accumulate before performing a backward/update pass. Only takes effect when gradient_checkpointing is True
- torch_dtype: Default is
bfloat16
. One ofbfloat16
,float32
,float16
- stop_after_seconds: Default is
-1.0
. If set, the optimizer will be asked to stop after the specified time elapses. The check is performed after the end of each training step. - distributed_backend: Default is
FSDP
for multi-gpu measurements,None
(i.e. Data Parallel (DP)) for single-gpu measurements. Which pytorch backend to use when training with multiple GPU devices. - number_nodes: Default is
1
. If set, actuator distributes tasks on multiple nodes. Each Node will use number_gpus/number_nodes GPUs. Each Node will use 1 process for each GPU it uses - fms_hf_tuning_version: Default is
2.1.2
. Which version of fms-hf-tuning to use. Available options are:2.8.2
,2.7.1
,2.6.0
,2.5.0
,2.4.0
,2.3.1
,2.2.1
,2.1.2
,2.1.0
,2.0.1
- enable_roce: Default is
False
. This setting is only in effect for multi-node runs. It controls whether RDMA over Converged Ethernet (RoCE) is switched on or not. - fast_moe: Default is
0
. Configures the amount of expert parallel sharding. number_gpus must be divisible by it - fast_kernels: Default is
None
. Switches on fast kernels, the value is a list with strings of boolean values for[fast_loss, fast_rms_layernorm, fast_rope_embeddings]
- optim: Default is
adamw_torch
. The optimizer to use. Available options areadamw_hf
,adamw_torch
,adamw_torch_fused
,adamw_torch_xla
,adamw_torch_npu_fused
,adamw_apex_fused
,adafactor
,adamw_anyprecision
,adamw_torch_4bit
,ademamix
,sgd
,adagrad
,adamw_bnb_8bit
,adamw_8bit
,ademamix_8bit
,lion_8bit
,lion_32bit
,paged_adamw_32bit
,paged_adamw_8bit
,paged_ademamix_32bit
,paged_ademamix_8bit
,paged_lion_32bit
,paged_lion_8bit
,rmsprop
,rmsprop_bnb
,rmsprop_bnb_8bit
,rmsprop_bnb_32bit
,galore_adamw
,galore_adamw_8bit
,galore_adafactor
,galore_adamw_layerwise
,galore_adamw_8bit_layerwise
,galore_adafactor_layerwise
,lomo
,adalomo
,grokadamw
,schedule_free_adamw
,schedule_free_sgd
- bf16: Default is
False
. Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA add bf16 mixed precision support for NPU architecture or using CPU (use_cpu) or Ascend NPU. This is an experimental API and it may change. Can beTrue
,False
. - gradient_checkpointing_use_reentrant: Default is
False
Specify whether to use the activation checkpoint variant that requires reentrant autograd. This parameter should be passed explicitly. Torch version 2.5 will raise an exception if use_reentrant is not passed. If use_reentrant=False, checkpoint will use an implementation that does not require reentrant autograd. This allows checkpoint to support additional functionality, such as working as expected with torch.autograd.grad and support for keyword arguments input into the checkpointed function. Can beTrue
,False
. - fsdp_sharding_strategy: Default is
FULL_SHARD
. [1] FULL_SHARD (shards optimizer states, gradients and parameters), " [2] SHARD_GRAD_OP (shards optimizer states and gradients), [3] NO_SHARD (DDP), [4] HYBRID_SHARD (shards optimizer states, gradients and parameters within each node while each node has full copy), [5] HYBRID_SHARD_ZERO2 (shards optimizer states and gradients within each node while each node has full copy). For more information, please refer the official PyTorch docs. - fsdp_state_dict_type: Default is
FULL_STATE_DICT
. [1] FULL_STATE_DICT, [2] LOCAL_STATE_DICT, [3] SHARDED_STATE_DICT - fsdp_use_orig_params: Default is
True
. If True, allows non-uniformrequires_grad
during init, which means support for interspersed frozen and trainable parameters. (useful only whenuse_fsdp
flag is passed). - accelerate_config_mixed_precision: Default is
no
. Whether to use mixed precision training or not. Choose fromno
,fp16
,bf16
orfp8
.fp8
requires the installation of transformers-engine. - accelerate_config_fsdp_transformer_layer_cls_to_wrap: Default is None. List of transformer layer class names (case-sensitive) to wrap, e.g,
GraniteDecoderLayer
,LlamaDecoderLayer
,MistralDecoderLayer
,BertLayer
,GPTJBlock
,T5Block
... (useful only when using FSDP) - dataset_text_field: Default is None. Training dataset text field containing single sequence. Either the dataset_text_field or data_formatter_template need to be supplied. For running vision language model tuning pass the column name for text data.
- dataset_image_field: Default is None. For running vision language model tuning pass the column name of the image data in the dataset.
- remove_unused_columns: Default is True. Remove columns not required by the model when using an nlp.Dataset.
- dataset_kwargs_skip_prepare_dataset: Default is False. When True, configures trl to skip preparing the dataset
Full Finetuning (Stability) Measured properties¶
- f_gpu_oom: fraction of tasks that ran out of GPU memory
- f_other_error: fraction of tasks that ran into an unknown error
- f_no_error: fraction of tasks that completed successfully
- is_valid: whether this collection of tasks is a valid point to investigate
Info
Because running accelerate
with a single gpu is unsupported, when setting number_gpus
to 1 this experiment actually runs the tuning.sft_trainer
script directly (i.e. a DataParallel (DP) run).
finetune_lora_benchmark-v1.0.0¶
Executes LoRA-based fine-tuning, a parameter-efficient method that adapts only a small subset of model weights. This benchmark is useful for scenarios where compute or memory resources are limited, while still enabling meaningful adaptation.
Experiment documentation
An experiment instance:
- performs LORA fine tuning
- the training data is artificial
use_flash_attn
is set to Truepacking
is set to Falsetorch_dtype
is set tobfloat16
by default, can also be float16- uses the
FSDP
distributed backend for multi-gpu runs by default, can also beDDP
- multi-gpu runs with FSDP and DDP backends use 1 process per GPU (via
accelerate
) - runs 1 epoch by default, can also run a custom number of steps
- does not save checkpoint
- loads weights from a PVC
- request 2 CPU cores per GPU device (with a minimum of 2 cores)
For FSDP runs we use the following accelerate_config.yml
YAML file:
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: ${fsdp_sharding_strategy}
fsdp_state_dict_type: ${fsdp_state_dict_type}
fsdp_cpu_ram_efficient_loading: true
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: ${accelerate_config_fsdp_transformer_layer_cls_to_wrap}
machine_rank: {$THE MACHINE RANK - always 0 for single-node runs}
main_training_function: main
mixed_precision: ${accelerate_config_mixed_precision}
num_machines: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: {$SOME_PORT}
num_processes: {$NUM_GPUS}
For DDP runs we use this instead:
compute_environment: LOCAL_MACHINE
debug: False
downcast_bf16: no
distributed_type: MULTI_GPU
machine_rank: {$THE MACHINE RANK - always 0 for single-node runs}
main_training_function: main
mixed_precision: ${accelerate_config_mixed_precision}
num_machines: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: {$SOME_PORT}
num_processes: {$NUM_GPUS}
Commandline:
accelerate launch --config_file ${PATH_ACCELERATE_CONFIG} --num_processes ${NUMBER_GPUS} \
${PATH_TO_OUR_WRAPPER_OF_FMS_HF_TUNING_SFT_TRAINER} --model_name_or_path ${MODEL} \
--torch_dtype bfloat16 --use_flash_attn True --training_data_path ${DATASET_PATH} \
--response_template "\n### Response:" --dataset_text_field output --log_level debug \
--num_train_epochs 1 --per_device_train_batch_size ${BATCH_SIZE/NUM_GPUS} \
--max_seq_length ${MODEL_MAX_LENGTH} --eval_strategy no --output_dir ${RANDOM_DIR} \
--gradient_accumulation_steps ${GRADIENT_ACCUMULATION_STEPS} --save_strategy no \
--learning_rate 1e-05 --weight_decay 0.0 --warmup_ratio 0.03 --lr_scheduler_type cosine \
--logging_steps 1 --include_tokens_per_second True --gradient_checkpointing True \
--packing False --peft_method lora --target_modules ${SPACE SEPARATED LAYER NAMES} \
--optim ${OPTIM} --bf16 ${BF16} \
--gradient_checkpointing_kwargs='{"use_reentrant": ${GRADIENT_CHECKPOINTING_USE_REENTRANT}}' \
--fast_moe ${FAST_MOE}
Note: --fast_moe
is only supported for fms-hf-tuning v2.4.0+
We use a thin wrapper of sft_trainer.py
which injects a custom Callback that exports the metrics collected by AIM. You can repeat our experiments by just pointing the above command-line to sft_trainer.py
from the fms-hf-tuning
package.
Versioning:
- Actuator version:
2.1.0
- fms-hf-tuning versions:
- 2.8.2
- 2.7.1
- 2.6.0
- 2.5.0
- 2.4.0
- 2.3.1
- 2.2.1
- 2.1.2 (default)
- 2.1.1
- 2.1.0
- 2.0.1
LoRA Requirements¶
- The PVC
hf-models-pvc
mounted under/hf-models-pvc
- should contain the models:- LLaMa/models/hf/13B/
- LLaMa/models/hf/7B/
- LLaMa/models/hf/llama2-70b/
- LLaMa/models/hf/llama3-70b/
- LLaMa/models/hf/llama3-8b/
- LLaMa/models/hf/llama3.1-405b/
- LLaMa/models/hf/llama3.1-70b/
- LLaMa/models/hf/llama3.1-8b/
- Mixtral-8x7B-Instruct-v0.1/
- allam-1-13b-instruct-20240607/
- granite-13b-base-v2/step_300000_ckpt/
- granite-20b-code-base-v2/step_280000_ckpt/
- granite-34b-code-base/
- granite-8b-code-base/
- granite-8b-japanese-base-v1-llama/
- mistralai-mistral-7b-v0.1/
- mistral-large/fp16_240620
- The PVC
ray-disorch-storage
mounted under/data
with the synthetic datasets of the SFTTrainer actuator
LoRA Entity space¶
Required:
- model_name: Supported models:
["granite-3b-1.5", "hf-tiny-model-private/tiny-random-BloomForCausalLM", "llama-7b", "granite-13b-v2", "llama-13b", "granite-20b-v2", "granite-7b-base", "granite-8b-japanese", "granite-8b-code-base", "granite-34b-code-base", "mistral-7b-v0.1", "llama3-8b", "llama3-70b", "mixtral-8x7b-instruct-v0.1", "llama2-70b", "llama3.1-8b", "llama3.1-70b", "llama3.1-405b", "granite-3b-code-base-128k", "granite-8b-code-base-128k", "allam-1-13b", "granite-3-8b", "granite-3.1-2b", "granite-3.1-8b-instruct", "mistral-123b-v2", "granite-3.1-3b-a800m-instruct", "granite-vision-3.2-2b", "smollm2-135m", "llava-v1.6-mistral-7b"]
- model_max_length: Maximum sequence length. Sequences will be right padded (and possibly truncated)
- number_gpus: The effective number of GPUs (to be evenly distributed to
number_nodes
machines) - batch_size: the effective batch_size (will be evenly distributed to max(1, number_gpus) devices)
- gpu_model: The value of the kubernetes node label
nvidia.com/gpu.prod
for exampleNVIDIA-A100-80GB-PCIe
NVIDIA-A100-SXM4-80GB
NVIDIA-H100-PCIe
Optional:
- dataset_id: Default is
news-tokens-16384plus-entries-4096
. Available options are: news-chars-512-entries-4096
: 4096 entries with samples of 512 + 127 (prompt) + 512 charactersnews-chars-1024-entries-4096
: 4096 entries with samples of 1024 + 127 (prompt) + 1024 charactersnews-chars-2048-entries-4096
: 4096 entries with samples of 2048 + 127 (prompt) + 2048 charactersnews-tokens-16384plus-entries-4096
: 4096 entries, each entry has least 16384 tokens when tokenized with any of the granite-13b-v2, llama-13b-v2, llama-7b, or granite-20b-v2 tokenizersvision-384x384-16384plus-entries-4096
: A vision dataset containing 4096 entries. Each entry includes at least 16384 tokens when tokenized withgranite-vision-3.2-2b
, and consists of repeated copies of a single image with dimensions 384×384.vision-384x768-16384plus-entries-4096
: Similar to the above, this dataset also contains 4096 entries with a minimum of 16384 tokens per entry (tokenized usinggranite-vision-3.2-2b
). Each entry uses repeated copies of a single image sized 384×768.- gradient_checkpointing: Default is
True
. IfTrue
, use gradient checkpointing to save memory (i.e. higher batchsizes) at the expense of slower backward pass - gradient_accumulation_steps: Default is 4. Number of update steps to accumulate before performing a backward/update pass. Only takes effect when gradient_checkpointing is True
- torch_dtype: Default is
bfloat16
. One ofbfloat16
,float32
,float16
- max_steps: Default is
-1
. The number of optimization steps to perform. Set to -1 to respect num_train_epochs instead. - num_train_epochs: Default is
1.0
. How many epochs to run. Ignored if max_steps is greater than 0. - stop_after_seconds: Default is
-1.0
. If set, the optimizer will be asked to stop after the specified time elapses. The check is performed after the end of each training step. - distributed_backend: Default is
FSDP
for multi-gpu measurements,None
(i.e. Data Parallel (DP)) for single-gpu measurements. Which pytorch backend to use when training with multiple GPU devices. - number_nodes: Default is
1
. If set, actuator distributes tasks on multiple nodes. Each Node will use number_gpus/number_nodes GPUs. Each Node will use 1 process for each GPU it uses - fms_hf_tuning_version: Default is
2.1.2
. Which version of fms-hf-tuning to use. Available options are:2.8.2
,2.7.1
,2.6.0
,2.5.0
,2.4.0
,2.3.1
,2.2.1
,2.1.2
,2.1.0
,2.0.1
- enable_roce: Default is
False
. This setting is only in effect for multi-node runs. It controls whether RDMA over Converged Ethernet (RoCE) is switched on or not. - fast_moe: Default is
0
. Configures the amount of expert parallel sharding. number_gpus must be divisible by it - fast_kernels: Default is
None
. Switches on fast kernels, the value is a list with strings of boolean values for[fast_loss, fast_rms_layernorm, fast_rope_embeddings]
- r: Default is
4
. The LORA rank - lora_alpha: Default is
16
. Scales the learning weights. - optim: Default is
adamw_torch
. The optimizer to use. Available options areadamw_hf
,adamw_torch
,adamw_torch_fused
,adamw_torch_xla
,adamw_torch_npu_fused
,adamw_apex_fused
,adafactor
,adamw_anyprecision
,adamw_torch_4bit
,ademamix
,sgd
,adagrad
,adamw_bnb_8bit
,adamw_8bit
,ademamix_8bit
,lion_8bit
,lion_32bit
,paged_adamw_32bit
,paged_adamw_8bit
,paged_ademamix_32bit
,paged_ademamix_8bit
,paged_lion_32bit
,paged_lion_8bit
,rmsprop
,rmsprop_bnb
,rmsprop_bnb_8bit
,rmsprop_bnb_32bit
,galore_adamw
,galore_adamw_8bit
,galore_adafactor
,galore_adamw_layerwise
,galore_adamw_8bit_layerwise
,galore_adafactor_layerwise
,lomo
,adalomo
,grokadamw
,schedule_free_adamw
,schedule_free_sgd
- bf16: Default is
False
. Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA add bf16 mixed precision support for NPU architecture or using CPU (use_cpu) or Ascend NPU. This is an experimental API and it may change. Can beTrue
,False
. - gradient_checkpointing_use_reentrant: Default is
False
Specify whether to use the activation checkpoint variant that requires reentrant autograd. This parameter should be passed explicitly. Torch version 2.5 will raise an exception if use_reentrant is not passed. If use_reentrant=False, checkpoint will use an implementation that does not require reentrant autograd. This allows checkpoint to support additional functionality, such as working as expected with torch.autograd.grad and support for keyword arguments input into the checkpointed function. Can beTrue
,False
. - fsdp_sharding_strategy: Default is
FULL_SHARD
. [1] FULL_SHARD (shards optimizer states, gradients and parameters), " [2] SHARD_GRAD_OP (shards optimizer states and gradients), [3] NO_SHARD (DDP), [4] HYBRID_SHARD (shards optimizer states, gradients and parameters within each node while each node has full copy), [5] HYBRID_SHARD_ZERO2 (shards optimizer states and gradients within each node while each node has full copy). For more information, please refer the official PyTorch docs. - fsdp_state_dict_type: Default is
FULL_STATE_DICT
. [1] FULL_STATE_DICT, [2] LOCAL_STATE_DICT, [3] SHARDED_STATE_DICT - fsdp_use_orig_params: Default is
True
. If True, allows non-uniformrequires_grad
during init, which means support for interspersed frozen and trainable parameters. (useful only whenuse_fsdp
flag is passed). - accelerate_config_mixed_precision: Default is
no
. Whether to use mixed precision training or not. Choose fromno
,fp16
,bf16
orfp8
.fp8
requires the installation of transformers-engine. - accelerate_config_fsdp_transformer_layer_cls_to_wrap: Default is None. List of transformer layer class names (case-sensitive) to wrap, e.g,
GraniteDecoderLayer
,LlamaDecoderLayer
,MistralDecoderLayer
,BertLayer
,GPTJBlock
,T5Block
... (useful only when using FSDP) - dataset_text_field: Default is None. Training dataset text field containing single sequence. Either the dataset_text_field or data_formatter_template need to be supplied. For running vision language model tuning pass the column name for text data.
- dataset_image_field: Default is None. For running vision language model tuning pass the column name of the image data in the dataset.
- remove_unused_columns: Default is True. Remove columns not required by the model when using an nlp.Dataset.
- dataset_kwargs_skip_prepare_dataset: Default is False. When True, configures trl to skip preparing the dataset
Hardcoded:
Sets the --target_modules
layer names based on the model_name
:
smollm2-135m
:["q_proj", "v_proj"]
granite-3.1-3b-a800m-instruct
:["q_proj", "v_proj"]
granite-vision-3.2-2b
:["q_proj", "v_proj"]
granite-3b-code-base-128k
:["q_proj", "v_proj"]
granite-7b-base
:["q_proj", "v_proj"]
granite-8b-code-base-128k
:["q_proj", "v_proj"]
granite-8b-code-base
:["q_proj", "v_proj"]
granite-8b-japanese
:["q_proj", "v_proj"]
granite-13b-v2
:["c_attn", "c_proj"]
granite-20b-v2
:["c_attn", "c_proj"]
granite-34b-code-base
:["c_attn", "c_proj"]
llama-7b
:["q_proj", "k_proj"]
llama-13b
:["q_proj", "k_proj"]
llama2-70b
:["q_proj", "v_proj"]
llama3-8b
:["q_proj", "k_proj"]
llama3-70b
:["q_proj", "v_proj"]
llama3.1-8b
:["q_proj", "v_proj"]
llama3.1-70b
:["q_proj", "v_proj"]
llama3.1-405b
:["q_proj", "v_proj"]
allam-1-13b
:["q_proj", "v_proj"]
hf-tiny-model-private/tiny-random-BloomForCausalLM
:["dense_h_to_4h", "dense_4h_to_4h"]
mistral-7b-v0.1
:["q_proj", "v_proj"]
mistral-123b-v2
:["q_proj", "v_proj"]
mixtral-8x7b-instruct-v0.1
:["q_proj", "v_proj"]
granite-3-8b
:["q_proj", "v_proj"]
granite-3.1-2b
:["q_proj", "v_proj"]
granite-3.1-8b-instruct
:["q_proj", "v_proj"]
llava-v1.6-mistral-7b
:["q_proj", "v_proj"]
Info
Because running accelerate
with a single gpu is unsupported, when setting number_gpus
to 1 this experiment actually runs the tuning.sft_trainer
script directly (i.e. a DataParallel (DP) run).
finetune_pt_benchmark-v1.0.0¶
Runs prompt-tuning, a lightweight fine-tuning strategy that prepends trainable prompts to the input. Similar to LoRA, this benchmark is useful for compute or memory constrained environments.
Experiment documentation
An experiment instance:
- performs prompt-tuning fine tuning
- the training data is artificial
use_flash_attn
is set to Truepacking
is set to Falsetorch_dtype
is set tobfloat16
by default, can also be float16- uses the
FSDP
distributed backend for multi-gpu runs by default, can also beDDP
- multi-gpu runs with FSDP and DDP backends use 1 process per GPU (via
accelerate
) - runs 1 epoch by default, can also run a custom number of steps
- does not save checkpoint
- loads weights from a PVC
- request 2 CPU cores per GPU device (with a minimum of 2 cores)
For FSDP runs we use the following accelerate_config.yml
YAML file:
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: ${fsdp_sharding_strategy}
fsdp_state_dict_type: ${fsdp_state_dict_type}
fsdp_cpu_ram_efficient_loading: true
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: ${accelerate_config_fsdp_transformer_layer_cls_to_wrap}
machine_rank: {$THE MACHINE RANK - always 0 for single-node runs}
main_training_function: main
mixed_precision: ${accelerate_config_mixed_precision}
num_machines: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: {$SOME_PORT}
num_processes: {$NUM_GPUS}
For DDP runs we use this instead:
compute_environment: LOCAL_MACHINE
debug: False
downcast_bf16: no
distributed_type: MULTI_GPU
machine_rank: {$THE MACHINE RANK - always 0 for single-node runs}
main_training_function: main
mixed_precision: ${accelerate_config_mixed_precision}
num_machines: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: {$SOME_PORT}
num_processes: {$NUM_GPUS}
Commandline:
accelerate launch --config_file ${PATH_ACCELERATE_CONFIG} --num_processes ${NUMBER_GPUS} \
${PATH_TO_OUR_WRAPPER_OF_FMS_HF_TUNING_SFT_TRAINER} --model_name_or_path ${MODEL} \
--torch_dtype bfloat16 --use_flash_attn True --training_data_path ${DATASET_PATH} \
--response_template "\n### Response:" --dataset_text_field output --log_level debug \
--num_train_epochs 1 --per_device_train_batch_size ${BATCH_SIZE/NUM_GPUS} \
--max_seq_length ${MODEL_MAX_LENGTH} --eval_strategy no --output_dir ${RANDOM_DIR} \
--gradient_accumulation_steps ${GRADIENT_ACCUMULATION_STEPS} --save_strategy no \
--learning_rate 1e-05 --weight_decay 0.0 --warmup_ratio 0.03 --lr_scheduler_type cosine \
--logging_steps 1 --include_tokens_per_second True --gradient_checkpointing True \
--packing False --peft_method none \
--fast_moe ${FAST_MOE}
Note: --fast_moe
is only supported for fms-hf-tuning v2.4.0+
We use a thin wrapper of sft_trainer.py
which injects a custom Callback that exports the metrics collected by AIM. You can repeat our experiments by just pointing the above command-line to sft_trainer.py
from the fms-hf-tuning
package.
Versioning:
- Actuator version:
2.1.0
- fms-hf-tuning versions:
- 2.8.2
- 2.7.1
- 2.6.0
- 2.5.0
- 2.4.0
- 2.3.1
- 2.2.1
- 2.1.2 (default)
- 2.1.1
- 2.1.0
- 2.0.1
Prompt Tuning Requirements¶
- The PVC
hf-models-pvc
mounted under/hf-models-pvc
- should contain the models:- LLaMa/models/hf/13B/
- LLaMa/models/hf/7B/
- LLaMa/models/hf/llama2-70b/
- LLaMa/models/hf/llama3-70b/
- LLaMa/models/hf/llama3-8b/
- LLaMa/models/hf/llama3.1-405b/
- LLaMa/models/hf/llama3.1-70b/
- LLaMa/models/hf/llama3.1-8b/
- Mixtral-8x7B-Instruct-v0.1/
- allam-1-13b-instruct-20240607/
- granite-13b-base-v2/step_300000_ckpt/
- granite-20b-code-base-v2/step_280000_ckpt/
- granite-34b-code-base/
- granite-8b-code-base/
- granite-8b-japanese-base-v1-llama/
- mistralai-mistral-7b-v0.1/
- mistral-large/fp16_240620
- The PVC
ray-disorch-storage
mounted under/data
with the synthetic datasets of the SFTTrainer actuator
Prompt Tuning Entity space¶
Required:
- model_name: Supported models:
["granite-3b-1.5", "hf-tiny-model-private/tiny-random-BloomForCausalLM", "llama-7b", "granite-13b-v2", "llama-13b", "granite-20b-v2", "granite-7b-base", "granite-8b-japanese", "granite-8b-code-base", "granite-34b-code-base", "mistral-7b-v0.1", "llama3-8b", "llama3-70b", "mixtral-8x7b-instruct-v0.1", "llama2-70b", "llama3.1-8b", "llama3.1-70b", "llama3.1-405b", "granite-3b-code-base-128k", "granite-8b-code-base-128k", "allam-1-13b", "granite-3-8b", "granite-3.1-2b", "granite-3.1-8b-instruct", "mistral-123b-v2", "granite-3.1-3b-a800m-instruct", "granite-vision-3.2-2b", "smollm2-135m", "llava-v1.6-mistral-7b"]
- model_max_length: Maximum sequence length. Sequences will be right padded (and possibly truncated)
- number_gpus: The effective number of GPUs (to be evenly distributed to
number_nodes
machines) - batch_size: the effective batch_size (will be evenly distributed to max(1, number_gpus) devices)
- gpu_model: The value of the kubernetes node label
nvidia.com/gpu.prod
for exampleNVIDIA-A100-80GB-PCIe
NVIDIA-A100-SXM4-80GB
NVIDIA-H100-PCIe
Optional:
- dataset_id: Default is
news-tokens-16384plus-entries-4096
. Available options are: news-chars-512-entries-4096
: 4096 entries with samples of 512 + 127 (prompt) + 512 charactersnews-chars-1024-entries-4096
: 4096 entries with samples of 1024 + 127 (prompt) + 1024 charactersnews-chars-2048-entries-4096
: 4096 entries with samples of 2048 + 127 (prompt) + 2048 charactersnews-tokens-16384plus-entries-4096
: 4096 entries, each entry has least 16384 tokens when tokenized with any of the granite-13b-v2, llama-13b-v2, llama-7b, or granite-20b-v2 tokenizersvision-384x384-16384plus-entries-4096
: A vision dataset containing 4096 entries. Each entry includes at least 16384 tokens when tokenized withgranite-vision-3.2-2b
, and consists of repeated copies of a single image with dimensions 384×384.vision-384x768-16384plus-entries-4096
: Similar to the above, this dataset also contains 4096 entries with a minimum of 16384 tokens per entry (tokenized usinggranite-vision-3.2-2b
). Each entry uses repeated copies of a single image sized 384×768.- gradient_checkpointing: Default is
True
. IfTrue
, use gradient checkpointing to save memory (i.e. higher batchsizes) at the expense of slower backward pass - gradient_accumulation_steps: Default is 4. Number of update steps to accumulate before performing a backward/update pass. Only takes effect when gradient_checkpointing is True
- torch_dtype: Default is
bfloat16
. One ofbfloat16
,float32
,float16
- max_steps: Default is
-1
. The number of optimization steps to perform. Set to -1 to respect num_train_epochs instead. - num_train_epochs: Default is
1.0
. How many epochs to run. Ignored if max_steps is greater than 0. - stop_after_seconds: Default is
-1.0
. If set, the optimizer will be asked to stop after the specified time elapses. The check is performed after the end of each training step. - distributed_backend: Default is
FSDP
for multi-gpu measurements,None
(i.e. Data Parallel (DP)) for single-gpu measurements. Which pytorch backend to use when training with multiple GPU devices. - number_nodes: Default is
1
. If set, actuator distributes tasks on multiple nodes. Each Node will use number_gpus/number_nodes GPUs. Each Node will use 1 process for each GPU it uses - fms_hf_tuning_version: Default is
2.1.2
. Which version of fms-hf-tuning to use. Available options are:2.8.2
,2.7.1
,2.6.0
,2.5.0
,2.4.0
,2.3.1
,2.2.1
,2.1.2
,2.1.0
,2.0.1
- enable_roce: Default is
False
. This setting is only in effect for multi-node runs. It controls whether RDMA over Converged Ethernet (RoCE) is switched on or not. - fast_moe: Default is
0
. Configures the amount of expert parallel sharding. number_gpus must be divisible by it - fast_kernels: Default is
None
. Switches on fast kernels, the value is a list with strings of boolean values for[fast_loss, fast_rms_layernorm, fast_rope_embeddings]
- optim: Default is
adamw_torch
. The optimizer to use. Available options areadamw_hf
,adamw_torch
,adamw_torch_fused
,adamw_torch_xla
,adamw_torch_npu_fused
,adamw_apex_fused
,adafactor
,adamw_anyprecision
,adamw_torch_4bit
,ademamix
,sgd
,adagrad
,adamw_bnb_8bit
,adamw_8bit
,ademamix_8bit
,lion_8bit
,lion_32bit
,paged_adamw_32bit
,paged_adamw_8bit
,paged_ademamix_32bit
,paged_ademamix_8bit
,paged_lion_32bit
,paged_lion_8bit
,rmsprop
,rmsprop_bnb
,rmsprop_bnb_8bit
,rmsprop_bnb_32bit
,galore_adamw
,galore_adamw_8bit
,galore_adafactor
,galore_adamw_layerwise
,galore_adamw_8bit_layerwise
,galore_adafactor_layerwise
,lomo
,adalomo
,grokadamw
,schedule_free_adamw
,schedule_free_sgd
- bf16: Default is
False
. Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA add bf16 mixed precision support for NPU architecture or using CPU (use_cpu) or Ascend NPU. This is an experimental API and it may change. Can beTrue
,False
. - gradient_checkpointing_use_reentrant: Default is
False
Specify whether to use the activation checkpoint variant that requires reentrant autograd. This parameter should be passed explicitly. Torch version 2.5 will raise an exception if use_reentrant is not passed. If use_reentrant=False, checkpoint will use an implementation that does not require reentrant autograd. This allows checkpoint to support additional functionality, such as working as expected with torch.autograd.grad and support for keyword arguments input into the checkpointed function. Can beTrue
,False
. - fsdp_sharding_strategy: Default is
FULL_SHARD
. [1] FULL_SHARD (shards optimizer states, gradients and parameters), " [2] SHARD_GRAD_OP (shards optimizer states and gradients), [3] NO_SHARD (DDP), [4] HYBRID_SHARD (shards optimizer states, gradients and parameters within each node while each node has full copy), [5] HYBRID_SHARD_ZERO2 (shards optimizer states and gradients within each node while each node has full copy). For more information, please refer the official PyTorch docs. - fsdp_state_dict_type: Default is
FULL_STATE_DICT
. [1] FULL_STATE_DICT, [2] LOCAL_STATE_DICT, [3] SHARDED_STATE_DICT - fsdp_use_orig_params: Default is
True
. If True, allows non-uniformrequires_grad
during init, which means support for interspersed frozen and trainable parameters. (useful only whenuse_fsdp
flag is passed). - accelerate_config_mixed_precision: Default is
no
. Whether to use mixed precision training or not. Choose fromno
,fp16
,bf16
orfp8
.fp8
requires the installation of transformers-engine. - accelerate_config_fsdp_transformer_layer_cls_to_wrap: Default is None. List of transformer layer class names (case-sensitive) to wrap, e.g,
GraniteDecoderLayer
,LlamaDecoderLayer
,MistralDecoderLayer
,BertLayer
,GPTJBlock
,T5Block
... (useful only when using FSDP) - dataset_text_field: Default is None. Training dataset text field containing single sequence. Either the dataset_text_field or data_formatter_template need to be supplied. For running vision language model tuning pass the column name for text data.
- dataset_image_field: Default is None. For running vision language model tuning pass the column name of the image data in the dataset.
- remove_unused_columns: Default is True. Remove columns not required by the model when using an nlp.Dataset.
- dataset_kwargs_skip_prepare_dataset: Default is False. When True, configures trl to skip preparing the dataset
Info
Because running accelerate
with a single gpu is unsupported, when setting number_gpus
to 1 this experiment actually runs the tuning.sft_trainer
script directly (i.e. a DataParallel (DP) run).
finetune_gtpq-lora_benchmark-v1.0.0¶
Combines LoRA with GPTQ quantization to enable fine-tuning on quantized models. This benchmark is tailored for scenarios where model size and inference efficiency are critical, and it leverages fused kernels and quantized weights for performance.
Experiment documentation
An experiment instance:
- performs LORA fine tuning
- the training data is artificial
use_flash_attn
is set to Truepacking
is set to Falsetorch_dtype
is set tofloat16
, cannot be a different value- uses the
FSDP
distributed backend for multi-gpu runs by default, can also beDDP
- multi-gpu runs with FSDP and DDP backends use 1 process per GPU (via
accelerate
) - runs 1 epoch by default, can also run a custom number of steps
- does not save checkpoint
- loads weights from a PVC
- request 2 CPU cores per GPU device (with a minimum of 2 cores)
- uses fms-acceleration plugins to perform GPTQ LoRA. Specifically:
auto_gptq
is set totriton_v2
fast_kernels
is set toTrue True True
fused_lora
is set toauto_gptq True
torch_dtype
is set tofloat16
- loads GPTQ compatible pre-quantized weights from a PVC
For FSDP runs we use the following accelerate_config.yml
YAML file:
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: ${fsdp_sharding_strategy}
fsdp_state_dict_type: ${fsdp_state_dict_type}
fsdp_cpu_ram_efficient_loading: true
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: ${accelerate_config_fsdp_transformer_layer_cls_to_wrap}
machine_rank: {$THE MACHINE RANK - always 0 for single-node runs}
main_training_function: main
mixed_precision: ${accelerate_config_mixed_precision}
num_machines: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: {$SOME_PORT}
num_processes: {$NUM_GPUS}
For DDP runs we use this instead:
compute_environment: LOCAL_MACHINE
debug: False
downcast_bf16: no
distributed_type: MULTI_GPU
machine_rank: {$THE MACHINE RANK - always 0 for single-node runs}
main_training_function: main
mixed_precision: ${accelerate_config_mixed_precision}
num_machines: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: {$SOME_PORT}
num_processes: {$NUM_GPUS}
Commandline:
accelerate launch --config_file ${PATH_ACCELERATE_CONFIG} --num_processes ${NUMBER_GPUS} \
${PATH_TO_OUR_WRAPPER_OF_FMS_HF_TUNING_SFT_TRAINER} --model_name_or_path ${MODEL} \
--torch_dtype float16 --use_flash_attn True --training_data_path ${DATASET_PATH} \
--response_template "\n### Response:" --dataset_text_field output --log_level debug \
--num_train_epochs 1 --per_device_train_batch_size ${BATCH_SIZE/NUM_GPUS} \
--max_seq_length ${MODEL_MAX_LENGTH} --eval_strategy no --output_dir ${RANDOM_DIR} \
--gradient_accumulation_steps ${GRADIENT_ACCUMULATION_STEPS} --save_strategy no \
--learning_rate 1e-05 --weight_decay 0.0 --warmup_ratio 0.03 --lr_scheduler_type cosine \
--logging_steps 1 --include_tokens_per_second True --gradient_checkpointing True \
--packing False --peft_method lora --target_modules ${SPACE SEPARATED LAYER NAMES} \
--fp16 true --fast_kernels true true true --fused_lora auto_gptq true --auto_gptq triton_v2 \
--optim ${OPTIM} --bf16 ${BF16} \
--gradient_checkpointing_kwargs='{"use_reentrant": ${GRADIENT_CHECKPOINTING_USE_REENTRANT}}' \
--fast_moe ${FAST_MOE}
Note: --fast_moe
is only supported for fms-hf-tuning v2.4.0+
We use a thin wrapper of sft_trainer.py
which injects a custom Callback that exports the metrics collected by AIM. You can repeat our experiments by just pointing the above command-line to sft_trainer.py
from the fms-hf-tuning
package.
Versioning:
- Actuator version:
2.1.0
- fms-hf-tuning versions:
- 2.8.2
- 2.7.1
- 2.6.0
- 2.5.0
- 2.4.0
- 2.3.1
- 2.2.1
- 2.1.2 (default)
- 2.1.1
- 2.1.0
- 2.0.1
GPTQ LoRA Requirements¶
- The PVC
hf-models-pvc
mounted under/hf-models-pvc
- should contain the models:- LLaMa/models/hf/7B-gptq/
- LLaMa/models/hf/llama3-70b-gptq/
- LLaMa/models/hf/llama3.1-405b-gptq/
- granite-20b-code-base-v2/step_280000_ckpt-gptq/
- granite-34b-gptq/
- granite-7b-base-gtpq/
- granite-8b-code-instruct-gptq/
- mistral-7B-v0.3-gptq/
- mixtral_8x7b_instruct_v0.1_gptq/
- The PVC
ray-disorch-storage
mounted under/data
with the synthetic datasets of the SFTTrainer actuator
GPTQ LoRA Entity space¶
Required:
- model_name: Supported models:
["llama-7b", "granite-20b-v2", "granite-7b-base", "granite-8b-code-instruct", "granite-34b-code-base", "mistral-7b-v0.1", "llama3-70b", "mixtral-8x7b-instruct-v0.1", "llama3.1-405b"]
- model_max_length: Maximum sequence length. Sequences will be right padded (and possibly truncated)
- number_gpus: The effective number of GPUs (to be evenly distributed to
number_nodes
machines) - batch_size: the effective batch_size (will be evenly distributed to max(1, number_gpus) devices)
- gpu_model: The value of the kubernetes node label
nvidia.com/gpu.prod
for exampleNVIDIA-A100-80GB-PCIe
NVIDIA-A100-SXM4-80GB
NVIDIA-H100-PCIe
Optional:
- dataset_id: Default is
news-tokens-16384plus-entries-4096
. Available options are: news-chars-512-entries-4096
: 4096 entries with samples of 512 + 127 (prompt) + 512 charactersnews-chars-1024-entries-4096
: 4096 entries with samples of 1024 + 127 (prompt) + 1024 charactersnews-chars-2048-entries-4096
: 4096 entries with samples of 2048 + 127 (prompt) + 2048 charactersnews-tokens-16384plus-entries-4096
: 4096 entries, each entry has least 16384 tokens when tokenized with any of the granite-13b-v2, llama-13b-v2, llama-7b, or granite-20b-v2 tokenizersvision-384x384-16384plus-entries-4096
: A vision dataset containing 4096 entries. Each entry includes at least 16384 tokens when tokenized withgranite-vision-3.2-2b
, and consists of repeated copies of a single image with dimensions 384×384.vision-384x768-16384plus-entries-4096
: Similar to the above, this dataset also contains 4096 entries with a minimum of 16384 tokens per entry (tokenized usinggranite-vision-3.2-2b
). Each entry uses repeated copies of a single image sized 384×768.- gradient_checkpointing: Default is
True
. IfTrue
, use gradient checkpointing to save memory (i.e. higher batchsizes) at the expense of slower backward pass - gradient_accumulation_steps: Default is 4. Number of update steps to accumulate before performing a backward/update pass. Only takes effect when gradient_checkpointing is True
- torch_dtype: Default is
float16
. One offloat16
- max_steps: Default is
-1
. The number of optimization steps to perform. Set to -1 to respect num_train_epochs instead. - num_train_epochs: Default is
1.0
. How many epochs to run. Ignored if max_steps is greater than 0. - stop_after_seconds: Default is
-1.0
. If set, the optimizer will be asked to stop after the specified time elapses. The check is performed after the end of each training step. - distributed_backend: Default is
FSDP
for multi-gpu measurements,None
(i.e. Data Parallel (DP)) for single-gpu measurements. Which pytorch backend to use when training with multiple GPU devices. - number_nodes: Default is
1
. If set, actuator distributes tasks on multiple nodes. Each Node will use number_gpus/number_nodes GPUs. Each Node will use 1 process for each GPU it uses - fms_hf_tuning_version: Default is
2.1.2
. Which version of fms-hf-tuning to use. Available options are:2.8.2
,2.7.1
,2.6.0
,2.5.0
,2.4.0
,2.3.1
,2.2.1
,2.1.2
,2.1.0
,2.0.1
- enable_roce: Default is
False
. This setting is only in effect for multi-node runs. It controls whether RDMA over Converged Ethernet (RoCE) is switched on or not. - fast_moe: Default is
0
. Configures the amount of expert parallel sharding. number_gpus must be divisible by it - fast_kernels: Default is
None
. Switches on fast kernels, the value is a list with strings of boolean values for[fast_loss, fast_rms_layernorm, fast_rope_embeddings]
- r: Default is
4
. The LORA rank - lora_alpha: Default is
16
. Scales the learning weights. - optim: Default is
adamw_torch
. The optimizer to use. Available options areadamw_hf
,adamw_torch
,adamw_torch_fused
,adamw_torch_xla
,adamw_torch_npu_fused
,adamw_apex_fused
,adafactor
,adamw_anyprecision
,adamw_torch_4bit
,ademamix
,sgd
,adagrad
,adamw_bnb_8bit
,adamw_8bit
,ademamix_8bit
,lion_8bit
,lion_32bit
,paged_adamw_32bit
,paged_adamw_8bit
,paged_ademamix_32bit
,paged_ademamix_8bit
,paged_lion_32bit
,paged_lion_8bit
,rmsprop
,rmsprop_bnb
,rmsprop_bnb_8bit
,rmsprop_bnb_32bit
,galore_adamw
,galore_adamw_8bit
,galore_adafactor
,galore_adamw_layerwise
,galore_adamw_8bit_layerwise
,galore_adafactor_layerwise
,lomo
,adalomo
,grokadamw
,schedule_free_adamw
,schedule_free_sgd
- bf16: Default is
False
. Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA add bf16 mixed precision support for NPU architecture or using CPU (use_cpu) or Ascend NPU. This is an experimental API and it may change. Can beTrue
,False
. - gradient_checkpointing_use_reentrant: Default is
False
Specify whether to use the activation checkpoint variant that requires reentrant autograd. This parameter should be passed explicitly. Torch version 2.5 will raise an exception if use_reentrant is not passed. If use_reentrant=False, checkpoint will use an implementation that does not require reentrant autograd. This allows checkpoint to support additional functionality, such as working as expected with torch.autograd.grad and support for keyword arguments input into the checkpointed function. Can beTrue
,False
. - fsdp_sharding_strategy: Default is
FULL_SHARD
. [1] FULL_SHARD (shards optimizer states, gradients and parameters), " [2] SHARD_GRAD_OP (shards optimizer states and gradients), [3] NO_SHARD (DDP), [4] HYBRID_SHARD (shards optimizer states, gradients and parameters within each node while each node has full copy), [5] HYBRID_SHARD_ZERO2 (shards optimizer states and gradients within each node while each node has full copy). For more information, please refer the official PyTorch docs. - fsdp_state_dict_type: Default is
FULL_STATE_DICT
. [1] FULL_STATE_DICT, [2] LOCAL_STATE_DICT, [3] SHARDED_STATE_DICT - fsdp_use_orig_params: Default is
True
. If True, allows non-uniformrequires_grad
during init, which means support for interspersed frozen and trainable parameters. (useful only whenuse_fsdp
flag is passed). - accelerate_config_mixed_precision: Default is
no
. Whether to use mixed precision training or not. Choose fromno
,fp16
,bf16
orfp8
.fp8
requires the installation of transformers-engine. - accelerate_config_fsdp_transformer_layer_cls_to_wrap: Default is None. List of transformer layer class names (case-sensitive) to wrap, e.g,
GraniteDecoderLayer
,LlamaDecoderLayer
,MistralDecoderLayer
,BertLayer
,GPTJBlock
,T5Block
... (useful only when using FSDP) - dataset_text_field: Default is None. Training dataset text field containing single sequence. Either the dataset_text_field or data_formatter_template need to be supplied. For running vision language model tuning pass the column name for text data.
- dataset_image_field: Default is None. For running vision language model tuning pass the column name of the image data in the dataset.
- remove_unused_columns: Default is True. Remove columns not required by the model when using an nlp.Dataset.
- dataset_kwargs_skip_prepare_dataset: Default is False. When True, configures trl to skip preparing the dataset
Hardcoded:
Sets the --target_modules
layer names based on the model_name
:
granite-8b-code-instruct
:["q_proj", "v_proj"]
granite-7b-base
:["q_proj", "v_proj"]
granite-20b-v2
:["c_attn", "c_proj"]
granite-34b-code-base
:["c_attn", "c_proj"]
llama-7b
:["q_proj", "k_proj"]
llama3-70b
:["q_proj", "v_proj"]
mistral-7b-v0.1
:["q_proj", "v_proj"]
mixtral-8x7b-instruct-v0.1
:["q_proj", "v_proj"]
llama3.1-405b
:["q_proj", "v_proj"]
allam-1-13b
:["q_proj", "v_proj"]
Info
Because running accelerate
with a single gpu is unsupported, when setting number_gpus
to 1 this experiment actually runs the tuning.sft_trainer
script directly (i.e. a DataParallel (DP) run).
Actuator Parameters¶
This section describes the fields you may optionally configure in your actuatorconfiguration
resource for the SFTTrainer
actuator.
Example Actuator Configuration YAML¶
actuatorIdentifier: SFTTrainer
parameters:
match_exact_dependencies: true
output_dir: "output"
data_directory: "/data/fms-hf-tuning/artificial-dataset/"
hf_home: "/hf-models-pvc/huggingface_home"
model_map:
granite-3.1-2b:
Vanilla: "ibm-granite/granite-3.1-2b-base"
num_tokens_cache_directory: "cache"
Configuration Fields¶
match_exact_dependencies
(bool, default: true
)¶
- Description: If
true
, the measurement runs in a virtual environment that exactly matches the Python packages of the selectedfms-hf-tuning
version. This enables all optional features likefast_kernels
,fast_moe
, andflash_attn
. - Set to
false
if running on devices with limited support (e.g., MacBooks or ARM CPUs), to avoid incompatible packages and features that depend on using NVIDIA GPUs.
output_dir
(str, default: "output"
)¶
- Description: Directory prefix where the fine-tuned model weights will be saved.
data_directory
(str, default: "/data/fms-hf-tuning/artificial-dataset/"
)¶
- Description: Path to the directory containing the dataset files used for fine-tuning.
aim_db
(str, default: None)¶
- Description: Endpoint of the AIM server used to log training metrics. When set to None the measurement will use a temporary AIM repository that will be garbage collected after the termination of the measurement.
aim_dashboard_url
(str or null, optional)¶
- Description: URL of the AIM dashboard. If set, this will be included in the metadata of the measurement results.
- Example:
"http://aim-dashboard.example.com"
hf_home
(str, default: "/hf-models-pvc/huggingface_home"
)¶
- Description: Directory where Hugging Face stores authentication tokens and model cache.
model_map
(dict, optional)¶
- Description: Maps model identifiers to their corresponding Hugging Face model ids and absolute paths. The contents of this dictionary will override the defaults that ship with the Actuator.
- Example:
model_map: granite-3.1-2b: Vanilla: "ibm-granite/granite-3.1-2b-base"
num_tokens_cache_directory
(str or null, default: "cache"
)¶
- Description: Directory used to cache token counts for datasets. This avoids recomputing token counts, which can be time-consuming. Relative paths are resolved under
@data_directory
. - Set to
null
to disable caching.
Measured Properties¶
Each experiment collects detailed runtime and system-level metrics using AIM. The AIM metrics are aggregated into the following before being stored in ado's database:
GPU Metrics¶
- Compute Utilization:
min
,avg
,max
(%) - Memory Utilization:
min
,avg
,max
,peak
(%) - Power Usage:
min
,avg
,max
(Watts and %)
CPU Metrics¶
- Compute Utilization: Average CPU usage per core (%)
- Memory Utilization: Average memory usage of the training process (%)
Training Performance¶
train_runtime
: Duration in seconds from the start of the first training step to the end of the last training step.train_samples_per_second
: May be inaccurate, as HuggingFace uses a heuristic to estimate this.train_steps_per_second
: May be inaccurate due to HuggingFace's heuristic-based measurement.train_tokens_per_second
: May be inaccurate, as it relies on HuggingFace's heuristic.train_tokens_per_gpu_per_second
: May be inaccurate for the same reason, HuggingFace uses a heuristic.dataset_tokens_per_second
: The actuator computes this in an accurate way.dataset_tokens_per_second_per_gpu
: The actuator computes this in an accurate way.
Info
We report all system metrics as min/avg/max over the duration of the run. GPU metrics are collected per device; CPU metrics are collected for the training process. Token throughput accounts for padding and sequence truncation.
Validation¶
Each experiment includes a computed is_valid
flag that indicates whether the run was structurally and functionally valid. A run is marked invalid if any of the following conditions are met:
Configuration Errors¶
batch_size
is not evenly divisible bynumber_gpus
number_gpus
is not evenly divisible bynumber_nodes
number_nodes
is less than 1batch_size
is less than 1gpu_model
is missing or empty whennumber_gpus > 0
Incompatible Mixture of Experts (MoE) Settings¶
fast_moe
is set butnumber_gpus
is not divisible by itfast_moe
is set but the model’snum_local_experts
is not divisible byfast_moe
Runtime Failures¶
- The run raises a
torch.cuda.OutOfMemoryError
(considered invalid due to GPU memory exhaustion) - The run raises a
RuntimeError: CUDA error: an illegal memory access was encountered
exception (considered invalid due to GPU memory exhaustion) -
The run raises other exceptions (e.g.,
RuntimeError
withNCCL Error
) - these are marked as failed and do not record any metricsNote: Failed runs are not persisted into ado's database. Restarting an operation will cause ado to retry them.
This validation logic ensures that only meaningful and resource-compatible runs are included in the information we store in ado's database.
Configure your RayCluster¶
Running SFTTrainer experiments requires:
Use the information below to deploy your RayCluster.
Annotating GPU workers with custom resources¶
The SFTTrainer
actuator leverages Ray's custom resource scheduling to efficiently allocate GPU-powered tasks to workers equipped with the appropriate hardware. It uses the following custom resources:
Custom Resource Types¶
-
full-worker
Some Ray tasks require exclusive access to an entire node. These tasks request thefull-worker
resource. GPU workers that occupy a full node should have exactly onefull-worker
custom resource. -
${GPU_MODEL}
This custom resource key corresponds to the specific GPU model available on the node, with the value indicating the number of devices. Supported GPU models include:NVIDIA-A100-SXM4-80GB
NVIDIA-A100-80GB-PCIe
NVIDIA-H100-80GB-HBM3
NVIDIA-H100-PCIe
Tesla-V100-PCIE-16GB
Tesla-T4
L40S
-
RoCE
Tasks that utilize RDMA over Converged Ethernet (RoCE) request theRoCE
resource. For guidance on configuring RoCE in your RayCluster, refer to the instructions linked at the bottom of this page.
Creating the datasets¶
The SFTTrainer actuator supports both text-to-text and image-to-text tuning experiments. Installing the actuator provides access to 2 command-line utilities for generating synthetic datasets.
By default, the actuator expects the dataset files under /data/fms-hf-tuning/artificial-dataset/
You can override this path by setting the data_directory
parameter via an ActuatorConfiguration resource and referencing it in the Operations you create. We include a link to the relevant documentation at the bottom of this page.
Dataset for text-to-text tasks¶
For text-to-text tasks, create a dataset file with the name news-tokens-16384plus-entries-4096.jsonl
.
Use the following command:
sfttrainer_generate_dataset_text -o /data/fms-hf-tuning/artificial-dataset/news-tokens-16384plus-entries-4096.jsonl
If you are working with a remote RayCluster, run this as a remote Ray job using a Ray runtime environment that contains the python package for the SFTTrainer actuator. At the bottom of this page you will find a link to our documentation on submitting remote Ray jobs that use the code of Actuators.
Note
If your RayCluster Worker nodes already have the SFTTrainer wheel installed, you can skip building the wheel and using a ray runtime environment. Go directly to the ray job submit
step. Just change the commandline so that it does not use the ray_runtime.yaml
file.
For example, build the wheel file for SFTTrainer and create the following ray_runtime_env.yaml
:
pip:
- ${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}/sfttrainer-0.9.4.dev84+g1ab8f43d-py3-none-any.whl
env_vars:
PYTHONUNBUFFERED: "x"
Note
Your wheel file will have a different name so update the ray_runtime_env.yaml
file accordingly. Make sure you keep the ${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}/
prefix.
Then start a Ray job that executes sfttrainer_generate_dataset_text
and pointing it to your remote RayCluster and references your ray_runtime_env.yaml
file. For example, if your RayCluster is listening on http://localhost:8265
run the following command in the same directory as your ray_runtime_env.yaml
file:
Info
If you are using a remote RayCluster on Kubernetes remember to start a port-forward to the RayCluster head node.
ray job submit --address http://localhost:8265 --runtime-env ray_runtime_env.yaml --working-dir $PWD -v -- \
sfttrainer_generate_dataset_text \
-o /data/fms-hf-tuning/artificial-dataset/news-tokens-16384plus-entries-4096.jsonl
Dataset for image-to-text tasks¶
SFTTrainer supports 2 datasets for text-to-image tasks:
vision-384x384-16384plus-entries-4096.parquet
vision-384x768-16384plus-entries-4096.parquet
To create the dataset files use the same ray_runtime_env.yaml
file as above but this time start 2 Ray Jobs:
ray job submit --address http://localhost:8265 --runtime-env ray_runtime_env.yaml --working-dir $PWD -v -- \
sfttrainer_generate_dataset_vision --image-width 384 --image-height 384 \
-o /data/fms-hf-tuning/artificial-dataset/vision-384x384-16384plus-entries-4096.parquet
ray job submit --address http://localhost:8265 --runtime-env ray_runtime_env.yaml --working-dir $PWD -v -- \
sfttrainer_generate_dataset_vision --image-width 384 --image-height 768 \
-o /data/fms-hf-tuning/artificial-dataset/vision-384x768-16384plus-entries-4096.parquet
Model Weights¶
The actuator supports model weights from both the HuggingFace repository and local directories. You can find the full list of supported models in the models.yaml file.
Note
The actuator attempts to cache Hugging Face model weights the first time it runs an operation that references them. To avoid race conditions when running multiple experiments with the same weights, we recommend pre-fetching the weights in advance.
Identify the models you want to cache and then create a models.yaml
file structured as a double-nested dictionary.
- The outer dictionary keys are the names of the models.
- Each inner dictionary maps model weight types to their corresponding Hugging Face identifiers.
Supported model weight types include:
Vanilla
QPTQ-Quantized
Here’s a simple example that caches the HuggingFaceTB/SmolLM2-135M
model weights from HuggingFace`:
smollm2-135m:
Vanilla: HuggingFaceTB/SmolLM2-135M
Next, choose a directory to use as your HuggingFace home. By default, SFTTrainer uses /hf-models-pvc/huggingface_home
. To override this, set the hf_home
parameter in your ActuatorConfiguration resource just like you did for overriding the location of dataset files.
For example, to cache the model weights under /my/hf_home/
use the following command:
sfttrainer_download_hf_weights -i models.yaml -o /my/hf_home
If you are working with a remote RayCluster then submit a Ray job similar to the above section for generating datasets:
Info
If you are using a remote RayCluster on Kubernetes remember to start a port-forward to the RayCluster head node.
ray job submit --address http://localhost:8265 --runtime-env ray_runtime_env.yaml --working-dir $PWD -v -- \
sfttrainer_download_hf_weights -i models.yaml -o /my/hf_home
Configure your RayCluster for RDMA over Converged Ethernet (RoCE)¶
RoCE enables high-throughput, low-latency communication between GPU nodes distributed on multiple nodes by bypassing the kernel and reducing CPU overhead. This is especially beneficial for multi-node AI workloads that rely on fast inter-GPU communication, such as distributed training with NVIDIA NCCL.
To enable RoCE in a RayCluster on Kubernetes, you need to:
- Build a GPU worker image with the necessary OFED and NCCL libraries.
- Configure the RayCluster custom resource to:
- Set environment variables for NCCL that switch on the RoCE feature.
- Mount the NCCL topology file to ensure optimal GPU-to-GPU communication paths are used during collective operations.
- Ensure the Kubernetes nodes and network are RoCE-capable and properly configured.
Here’s the revised and improved list of prerequisites for enabling RoCE in GPU workers of a RayCluster on Kubernetes, incorporating clarity, completeness, and technical accuracy:
Prerequisites¶
- The system administrator has configured the GPU nodes and network infrastructure to support RoCE, including BIOS, firmware, switch settings, and lossless Ethernet features.
- The NCCL topology file is provided by the system administrator to optimize GPU communication paths.
- The Kubernetes administrator has granted the RayCluster service account appropriate RBAC permissions and PodSecurity settings necessary for RoCE. In this example we will:
- Run containers as
root
. - Use the
IPC_LOCK
capability to lock memory.
- Run containers as
- The device plugin for RoCE-capable NICs (e.g., NVIDIA Network Operator or custom RDMA plugin) is installed and configured on the cluster.
- The GPU worker has the required drivers and libraries. In this example, we will deploy Ray on a Kubernetes cluster. Thus, our image will contain:
- OFED modules
- the NVIDIA and NCCL runtime binaries
- The system administrator has shared the number of GPUs and RoCE-capable NICs available per node to guide resource requests and topology mapping.
- The Kubernetes administrator has explained how to:
- Request RoCE devices (e.g.,
nvidia.com/roce_gdr: 2
) - Enable pods to access the RDMA-enabled network zones
- Schedule GPU workers on the correct nodes e.g via labels, taints, affinity rules, etc
- Request RoCE devices (e.g.,
Install the required libraries and drivers¶
This example walks you through deploying a RayCluster on Kubernetes, including building a custom image for the GPU worker nodes. We’ll use the mirror.gcr.io/rayproject/ray:latest-py310-cu121
base image, which includes both Ray and the necessary NVIDIA libraries.
ARG base_image=mirror.gcr.io/rayproject/ray:latest-py310-cu121
FROM $base_image
USER 0
ENV MOFED_VER=24.10-1.1.4.0
ENV OS_VER=ubuntu22.04
ENV PLATFORM=x86_64
RUN mkdir app && \
cd app && \
wget -q http://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VER}/MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}.tgz && \
tar -xvzf MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}.tgz && \
cd MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM} && \
./mlnxofedinstall --user-space-only --without-fw-update --without-ucx-cuda --all --force --distro $OS_VER && \
cd .. && \
rm -rf MLNX* && \
apt-get -y clean && \
rm -rf /var/lib/apt/lists/*
Note
Mellanox OFED is now in long-term support and will reach end-of-life in Q4 2027. NVIDIA has replaced it with DOCA-OFED, which will receive all future updates and features. This example currently uses MLNX_OFED, but we’ll update it with DOCA-OFED installation steps in a future revision.
Collect all necessary information¶
Identify RoCE-Capable Network Devices¶
To determine which network interfaces support RoCE v2, run the show_gids
command on a GPU node. Look for entries where the VER
column is v2
, which indicates RoCE v2 support.
For example, given the following output:
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_0 1 0 fe80:0000:0000:0000:0000:60ff:fe68:d096 v1 net1-0
mlx5_0 1 1 fe80:0000:0000:0000:0000:60ff:fe68:d096 v2 net1-0
...
mlx5_3 1 1 fe80:0000:0000:0000:0000:5fff:fe68:d09a v2 net1-1
You should select the devices with v2
under the VER
column. In this case, the RoCE-capable devices are:
mlx5_0_1
mlx5_3_1
You will use these device names to set the NCCL_IB_HCA
environment variable in your Ray GPU worker pods. For the above example you will set NCCL_IB_HCA="=mlx5_0,mlx5_3"
You also need to configure NCCL_IB_GID_INDEX
. Select the GID index such that it maps to a v2 entry across all nodes to ensure consistent behavior. For the above example you will set NCCL_IB_GID_INDEX=1
Putting it all together¶
In this section we will use the information we gathered above to define a Ray GPU worker with support for RoCE.
Summary of Steps¶
- Enable memory locking in containers
- Request the
IPC_LOCK
capability in your container’s security context. - Use a
ServiceAccount
(e.g.gdr
) that grants permission to requestIPC_LOCK
. - To allow unlimited memory locking:
- Option A: Run the container as root (in the example we assume that the
roce
service account has adequate RBAC to request this). - Option B: Configure the node with
ulimit -l unlimited
(not available on Vela).
- Request the
- Attach and request RoCE-capable NICs
- On our cluster:
- We add the annotation:
k8s.v1.cni.cncf.io/networks: multi-nic-network
- Request RoCE devices:
nvidia.com/roce_gdr: 2
- We add the annotation:
- On our cluster:
- Set NCCL environment variables
- Configure variables like
NCCL_IB_HCA
,NCCL_IB_GID_INDEX
, and others to enable RoCE and optimize performance.
- Configure variables like
- Mount the NCCL topology file
- Mount the
topology-roce
ConfigMap at/var/run/nvidia-topologyd
.
- Mount the
# ... trimmed ...
workerGroupSpecs:
- rayStartParams:
block: "true"
num-gpus: "8"
# VV: We'll use the RoCE custom resource to ensure the jobs land on a properly configured node for RoCE
# we support running up to 1 RoCE measurement. Similarly, we have a custom resource called
# "full-worker" for reserving the entire GPU worker if necessary.
resources: "\"{\\\"NVIDIA-A100-SXM4-80GB\\\": 8, \\\"full-worker\\\": 1, \\\"RoCE\\\": 1}\""
# Here, we configure an eightou GPU worker with A100 that can have up to 4 replicas
replicas: 4
minReplicas: 4
maxReplicas: 4
numOfHosts: 1
groupName: eight-A100-80G-gpu-WG
template:
metadata:
annotations:
# We use this annotation on our cluster to get access to the appropriate network zone
k8s.v1.cni.cncf.io/networks: multi-nic-network
labels:
helm.sh/chart: ray-cluster-1.1.0
app.kubernetes.io/instance: ray-disorch
# ServiceAccount gives your pod adequate RBAC to request the IPC_LOCK capability and run as root
serviceAccountName: roce
# RoCE requires root privileges.
# An alternative to using a root account is to request the capability CAP_SYS_RESOURCE and
# run `ulimit -l unlimited` before starting up the Ray worker
securityContext:
fsGroup: 0
runAsGroup: 0
runAsNonRoot: false
runAsUser: 0
volumes:
volumes:
- name: topology-volume
configMap:
name: topology-roce
- name: dshm
emptyDir:
medium: Memory
# Add your remaining PVCs here e.g. a GPFS volume for storing the
# HF_HOME path that you will use with "accelerate launch" etc
containers:
- name: pytorch
image: $YOUR_ROCE_ENABLED_IMAGE_HERE
imagePullPolicy: Always
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- IPC_LOCK # for RoCE to work
env:
# To enable RoCE
- name: NCCL_IB_HCA
value: =mlx5_0,mlx5_3
- name: NCCL_IB_GID_INDEX
value: "1"
- name: NCCL_IB_DISABLE
value: "0" # Set this to "1" to disable RoCE
# To visually verify that RoCE is On based on the logs that NCCL prints
- name: NCCL_DEBUG
value: INFO
- name: NCCL_DEBUG_SUBSYS
value: "INIT,BOOTSTRAP,ENV"
# Remaining NCCL environment variables we use on our cluster
- name: NCCL_IB_QPS_PER_CONNECTION
value: "8"
- name: NCCL_IB_SPLIT_DATA_ON_QPS
value: "0"
- name: NCCL_IB_PCI_RELAXED_ORDERING
value: "1"
- name: NCCL_ALGO
value: Ring
- name: NCCL_IGNORE_CPU_AFFINITY
value: "1"
- name: NCCL_SOCKET_NTHREADS
value: "2"
- name: NCCL_CROSS_NIC
value: "0"
- name: OMP_NUM_THREADS
value: "16"
volumeMounts:
- name: topology-volume
mountPath: /var/run/nvidia-topologyd
# Your other volumemounts here
- name: dshm
mountPath: "/dev/shm"
resources:
# Here we are requesting an entire node
requests:
cpu: 60
nvidia.com/gpu: 8
memory: 720Gi
nvidia.com/roce_gdr: 2
limits:
cpu: 60
nvidia.com/gpu: 8
memory: 720Gi
nvidia.com/roce_gdr: 2
Note
We recommend enabling RoCE only for GPU workers that occupy an entire Kubernetes node. This ensures that multi-node jobs are using separate Kubernetes nodes, allowing RoCE to be effectively utilized.
Verify you're using RoCE:¶
Remember, RoCE only applies to multi-node jobs. To verify that it’s working, run a multi-node NCCL job and inspect the logs. If your GPU workers are properly configured for RoCE, you should see output similar to the snippet below. We’ve annotated the important lines with <--
and added comments to highlight what to look for.
[0] NCCL INFO NCCL_IB_DISABLE set by environment to 0. <-- double check that this is set to 0
[4] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_3 <-- This does not confirm that you are using RoCE
[3] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_3:1/RoCE [RO]; OOB net1-0:1.2.3.21<0> <-- Name of the NICs
and /RoCE
[3] NCCL INFO Using non-device net plugin version 0
[3] NCCL INFO Using network IB <-- Uses the IB network
NCCL falls back to "Socket" network when RoCE is unavailable. In this scenario your log output will be similar to the snippet below.
[1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. <-- if this is set to 1, you will NOT use RoCE even
if it is properly configured
[1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to net1-0,net1-1
[1] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_3
[1] NCCL INFO NET/IB : No device found. # <-- No network Infiniband network found
[1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to net1-0,net1-1
[1] NCCL INFO NET/Socket : Using [0]net1-0:1.2.3.30<0> [1]net1-1:1.2.4.30<0> # <-- No mention of /RoCE
or the NICs
[1] NCCL INFO Using non-device net plugin version 0
[2] NCCL INFO Using network Socket # <-- Switches to TCP
Note
You might see warnings indicating that NCCL failed to load certain .so files. These messages are harmless and unrelated to RoCE configuration. You can ignore them safely.
Next steps:¶
-
⚙️ Customize Actuators using ActuatorConfiguration resources
Learn how to use ActuatorConfiguration resources to customize the SFTTrainer Operations
-
🖥️ Ready to try it out?
The SFTTrainer actuator can run experiments locally as well. Just follow the example below to get started:
-
Take it to the next level
Do you have a RayCluster with GPUs in it?