Exploring vLLM deployment configurations¶
Note
This example illustrates using the vllm-performance actuator to discover how best to deploy vLLM for a given use-case
Important
Prerequisites
- Access to a k8s namespace where you can deploy vLLM
The scenario¶
When deploying vLLM, you need to choose values for parameters like GPU type, batch size, and memory limits. These choices directly affect performance, cost, and scalability. To find the best configuration for your workload, whether you're optimizing for latency, throughput, or cost—you need to explore the deployment parameter space.
In this example,
- we will define a space of vLLM deployment configurations to test with the
vllm_performanceactuator'sperformance_testing_fullexperiment - This experiment can create and characterize a vLLM deployment on Kubernetes
- use the
random_walkoperator to explore the space
Install the actuator¶
Execute:
pip install -e plugins/actuators/vllm_performance
in the root of the ado source repository. You can clone the repository with
git clone https://github.com/IBM/ado.git
Verify the installation with:
ado get actuators --details
The actuator vllm_performance will appear in the list of available actuators.
Create an actuator configuration¶
The vllm-performance actuator needs some information the target cluster to deploy on. This is provided via an actuatorconfiguration.
First execute,
# Generate the template file
ado template actuatorconfiguration --actuator-identifier vllm_performance
This will create a file called:
Edit the file and set correct values for the following fields:
hf_token: <your HuggingFace access token>
namespace: vllm-testing # OpenShift namespace you have write access to
node_selector: '{"kubernetes.io/hostname":"<host-with-gpu>"}' # JSON string selecting a node that owns GPU
Then save this configuration as an actuatorconfiguration resource:
ado create actuatorconfiguration -f $CONFIG_FILE
Record the identifier of the created actuatorconfiguration as it will be used later.
Tip
You can create multiple actuator configurations corresponding to different clusters/target environments. You choose the one to use when you launch an operation requiring the actuator
Define the configurations to test¶
When exploring vLLM deployments there are two sets of parameters that can be varied:
- the deployment creation parameters (number GPUs, memory allocated etc)
- the benchmark test parameters (request per second to send, tokens per request etc.)
In this case we define a space where we look at the impact of a few vLLM deployment parameters, including max_num_seq and max_batch_tokens, for a scenario where requests arrive between 1 and 10 per second with sizes around 2000 tokens .
sampleStoreIdentifier: 2963a5
entitySpace:
- identifier: model
propertyDomain:
values:
- ibm-granite/granite-3.3-8b-instruct
- identifier: image
propertyDomain:
values:
- quay.io/dataprep1/data-prep-kit/vllm_image:0.1
- identifier: "number_input_tokens"
propertyDomain:
values: [1024, 2048, 4096]
- identifier: "request_rate"
propertyDomain:
domainRange: [1,10]
interval: 1
- identifier: n_cpus
propertyDomain:
domainRange: [2,16]
interval: 2
- identifier: memory
propertyDomain:
values: ["128Gi", "256Gi"]
- identifier: "max_batch_tokens"
propertyDomain:
values: [1024, 2048, 4096, 8192, 16384, 32768]
- identifier: "max_num_seq"
propertyDomain:
values: [16,32,64]
- identifier: "n_gpus"
propertyDomain:
values: [1]
- identifier: "gpu_type"
propertyDomain:
values: ["NVIDIA-A100-80GB-PCIe"]
experiments:
- actuatorIdentifier: vllm_performance
experimentIdentifier: performance-testing-full
metadata:
description: A space of vllm deployment configurations
name: vllm_deployments
Save the above as vllm_discoveryspace.yaml. Then, if you have an existing samplestore, run
ado create space -f vllm_discoveryspace.yaml --set sampleStoreIdentifier=$SAMPLE_STORE_ID
otherwise create a new one:
ado create space -f vllm_discoveryspace.yaml --new-sample-store
Record the identifier of the created discoveryspace as it will be used in next section.
Explore the space with random_walk¶
Next we'll scan this space sequentially using a grouped sampler to increase efficiency. The grouped sampler ensures we explore all the different benchmark configurations for a given vLLM deployment before creating a new deployment - minimising the need the number of deployment creations.
metadata:
name: randomwalk-grouped-vllm-performance-full
spaces:
- space-230d24-03b22d
actuatorConfigurationIdentifiers:
- actuatorconfiguration-vllm_performance-09fcdf30
operation:
module:
moduleClass: RandomWalk
parameters:
numberEntities: all
batchSize: 1
samplerConfig:
mode: 'sequentialgrouped'
samplerType: 'generator'
grouping: #A unique combination of these properties is a new vLLM deployment
- model
- image
- memory
- max_batch_tokens
- max_num_seq
- n_gpus
- gpu_type
- n_cpus
Save the above as random_walk.yaml. Then execute the operation:
ado create operation -f random_walk.yaml --set "spaces[0]=$DISCOVERY_SPACE_ID" --set actuatorConfigurationIdentifier[0]=$ACTUATOR_CONFIGURATION_IDENTIFIER
where $DISCOVERY_SPACE_ID is the identifier of the discoveryspace you created in the previous step, and ACTUATOR_CONFIGURATION_IDENTIFIER is the identifier of the actuatorconfiguration created earlier.
Monitor the optimization¶
While the operation is running you can watch the deployment:
# In a separate terminal
oc get deployments --watch -n vllm-testing
You can see the measurement requests as the operation runs by executing (in another terminal):
ado show requests operation $OPERATION_ID
and the results (this outputs the entities in sampled order):
ado show entities operation $OPERATION_ID
If the operation is running the $OPERATION_ID will have been output just before the sampling started. Assuming no other operation was started it will also be the last id output by
ado get operations
Check final results¶
When the output indicates that the experiment has finished, you can inspect the results of all operations run so far on the space with:
ado show entities space $DISCOVERY_SPACE_ID --output-format csv
Note
At any time after an operation, $OPERATION_ID, is finished you can run ado show entities operation $OPERATION_ID to see the sampling time-series of that operation.
Next steps¶
- Try varying
max_batch_tokensorgpu_memory_utilizationto explore the impact on throughput. - Try creating a difference
actuatorconfigurationwith moremax_environmentsand running the random walk with a non-grouped sampler - Replace the model with a different HF checkpoint to compare performance.
- Use RayTune (see the vLLM endpoint performance example) to optimise the hyper‑parameters of the benchmark.