Exploring vLLM deployment configurations¶
The scenario
In this example, the vllm_performance actuator is used to evaluate different vLLM server deployment configurations on Kubernetes/OpenShift.
When deploying vLLM, you must choose values for parameters like GPU type, batch size, and memory limits. These choices directly affect performance, cost, and scalability. To find the best configuration for your workload, whether you are optimizing for latency, throughput, or cost, you need to explore the deployment parameter space. In this example:
- We will define a space of vLLM deployment configurations to test with the
vllm_performanceactuator'sperformance_testing_fullexperiment- This experiment can create and characterize a vLLM deployment on Kubernetes
- Use the
random_walkoperator to explore the space
Prerequisites
- Be logged-in to your Kubernetes/OpenShift cluster
- Have access to a namespace where you can create vLLM deployments
- Install the following Python packages locally:
pip install ado-vllm-performance
TL;DR
Get the files vllm_deployment_space.yaml, vllm_actuator_configuration.yaml and operation_random_walk.yaml from our repository.
You must edit vllm_actuator_configuration.yaml with your details. In particular the following two fields are important:
hf_token: <your HuggingFace access token> # Required to access gated models
namespace: vllm-testing # you MUST set this to a namespace where you can create vLLM deployments
Then, in a directory with these files, execute:
: # Define the configurations to explore
ado create space -f vllm_deployment_space.yaml
: # Create a configuration for the actuator - normally just once as it can be reused
ado create actuatorconfiguration -f vllm_actuator_configuration.yaml
: # Explore!
ado create operation -f random_walk_operation_grouped.yaml --use-latest space --use-latest actuatorconfiguration
vllm_performance actuator for more configuration options. Verify the installation¶
Verify the installation with:
ado get actuators --details
The actuator vllm_performance should appear in the list of available actuators if installation completed successfully.
Create an actuator configuration¶
The vllm-performance actuator needs some information about the target cluster to deploy on. This is provided via an actuatorconfiguration.
First execute:
ado template actuatorconfiguration --actuator-identifier vllm_performance -o vllm_actuator_configuration.yaml
This will create a file called vllm_actuator_configuration.yaml
Edit the file and set correct values for at least the namespace field. Also consider if you need to supply a value for hf_token :
hf_token: <your HuggingFace access token> # Required to access gated models
namespace: vllm-testing # you MUST set this to a namespace where you can create vLLM deployments
Then save this configuration as an actuatorconfiguration resource:
ado create actuatorconfiguration -f vllm_actuator_configuration.yaml
Tip
You can create multiple actuator configurations corresponding to different target environments. You choose the one to use when you launch an operation requiring the actuator.
Define the configurations to test¶
When exploring vLLM deployments there are two sets of parameters that can be changed:
- the deployment creation parameters (number GPUs, memory allocated etc)
- the benchmark test parameters (request per second to send, tokens per request etc.)
In this case we define a space where we look at the impact of a few vLLM deployment parameters, including max_num_seq and max_batch_tokens, for a scenario where requests arrive between 1 and 10 per second with sizes around 2000 tokens.
entitySpace:
- identifier: model
propertyDomain:
values:
- ibm-granite/granite-3.3-8b-instruct
- identifier: image
propertyDomain:
values:
- quay.io/dataprep1/data-prep-kit/vllm_image:0.1
- identifier: "number_input_tokens"
propertyDomain:
values: [1024, 2048, 4096]
- identifier: "request_rate"
propertyDomain:
domainRange: [1,10]
interval: 1
- identifier: n_cpus
propertyDomain:
domainRange: [2,16]
interval: 2
- identifier: memory
propertyDomain:
values: ["128Gi", "256Gi"]
- identifier: "max_batch_tokens"
propertyDomain:
values: [8192, 16384, 32768]
- identifier: "max_num_seq"
propertyDomain:
values: [32,64]
- identifier: "n_gpus"
propertyDomain:
values: [1]
- identifier: "gpu_type"
propertyDomain:
values: ["NVIDIA-A100-80GB-PCIe"]
experiments:
- actuatorIdentifier: vllm_performance
experimentIdentifier: test-deployment-v1
metadata:
description: A space of vllm deployment configurations
name: vllm_deployments
Save the above as vllm_deployment_space.yaml. Then run:
ado create space -f vllm_deployment_space.yaml
Explore the space with random_walk¶
Next, we'll scan this space sequentially using a grouped sampler to increase efficiency. The grouped sampler ensures we explore all the different benchmark configurations for a given vLLM deployment before creating a new deployment - minimizing the number of deployment creations.
Save the following as random_walk_operation_grouped.yaml:
metadata:
name: randomwalk-grouped-vllm-performance-full
spaces:
- <Will be set via ado>
actuatorConfigurationIdentifiers:
- <Will be set via ado>
operation:
module:
moduleClass: RandomWalk
parameters:
numberEntities: all
batchSize: 1
samplerConfig:
mode: 'sequentialgrouped'
samplerType: 'generator'
grouping: #A unique combination of these properties is a new vLLM deployment
- model
- image
- memory
- max_batch_tokens
- max_num_seq
- n_gpus
- gpu_type
- n_cpus
Then, start the operation with:
ado create operation -f random_walk_operation_grouped.yaml \
--use-latest space --use-latest actuatorconfiguration
As it runs a table of the results is updated live in the terminal as they come in.
Monitor the optimization¶
While the operation is running you can monitor the deployment:
# In a separate terminal
oc get deployments --watch -n vllm-testing
You can also get the results table by executing (in another terminal)
ado show entities operation --use-latest
Check final results¶
When the output indicates that the experiment has finished, you can inspect the results of all operations run so far on the space with:
ado show entities space --output-format csv --use-latest
Note
At any time after an operation, $OPERATION_ID, is finished you can run ado show entities operation $OPERATION_ID to see the sampling time-series of that operation.
Next steps¶
- Try varying
max_batch_tokensorgpu_memory_utilizationto explore the impact on throughput. - Try creating a different
actuatorconfigurationwith moremax_environmentsand running the random walk with a non-grouped sampler - Replace the model with a different HF checkpoint to compare performance.
- Use RayTune (see the vLLM endpoint performance example) to optimise the hyper‑parameters of the benchmark.
- Run the exploration on the OpenShift/Kubernetes cluster you create the deployments on, so you don't have to keep your laptop open.
- Check the
vllm_performanceactuator documentation