Exploring vLLM deployment configurations¶

The scenario

In this example, the vllm_performance actuator is used to evaluate different vLLM server deployment configurations on Kubernetes/OpenShift.

When deploying vLLM, you must choose values for parameters like GPU type, batch size, and memory limits. These choices directly affect performance, cost, and scalability. To find the best configuration for your workload, whether you are optimizing for latency, throughput, or cost, you need to explore the deployment parameter space. In this example:

We will define a space of vLLM deployment configurations to test with the vllm_performance actuator's performance_testing_full experiment
- This experiment can create and characterize a vLLM deployment on Kubernetes
Use the random_walk operator to explore the space

Prerequisites

Be logged-in to your Kubernetes/OpenShift cluster
Have access to a namespace where you can create vLLM deployments
Install the following Python packages locally:

pip install ado-vllm-performance

TL;DR

Get the files vllm_deployment_space.yaml, vllm_actuator_configuration.yaml and operation_random_walk.yaml from our repository.

You must edit vllm_actuator_configuration.yaml with your details. In particular the following two fields are important:

hf_token: <your HuggingFace access token> # Required to access gated models
namespace: vllm-testing # you MUST set this to a namespace where you can create vLLM deployments

Then, in a directory with these files, execute:

: # Define the configurations to explore
ado create space -f vllm_deployment_space.yaml
: # Create a configuration for the actuator - normally just once as it can be reused
ado create actuatorconfiguration -f vllm_actuator_configuration.yaml
: # Explore!
ado create operation -f random_walk_operation_grouped.yaml --use-latest space --use-latest actuatorconfiguration

See configuring the vllm_performance actuator for more configuration options.

Verify the installation¶

Verify the installation with:

ado get actuators --details

The actuator vllm_performance should appear in the list of available actuators if installation completed successfully.

Create an actuator configuration¶

The vllm-performance actuator needs some information about the target cluster to deploy on. This is provided via an actuatorconfiguration.

First execute:

ado template actuatorconfiguration --actuator-identifier vllm_performance -o vllm_actuator_configuration.yaml

This will create a file called vllm_actuator_configuration.yaml

Edit the file and set correct values for at least the namespace field. Also consider if you need to supply a value for hf_token :

hf_token: <your HuggingFace access token> # Required to access gated models
namespace: vllm-testing # you MUST set this to a namespace where you can create vLLM deployments

Then save this configuration as an actuatorconfiguration resource:

ado create actuatorconfiguration -f vllm_actuator_configuration.yaml

Tip

You can create multiple actuator configurations corresponding to different target environments. You choose the one to use when you launch an operation requiring the actuator.

Define the configurations to test¶

When exploring vLLM deployments there are two sets of parameters that can be changed:

the deployment creation parameters (number GPUs, memory allocated etc)
the benchmark test parameters (request per second to send, tokens per request etc.)

In this case we define a space where we look at the impact of a few vLLM deployment parameters, including max_num_seq and max_batch_tokens, for a scenario where requests arrive between 1 and 10 per second with sizes around 2000 tokens.

entitySpace:
  - identifier: model
    propertyDomain:
      values:
        - ibm-granite/granite-3.3-8b-instruct
  - identifier: image
    propertyDomain:
      values:
        - quay.io/dataprep1/data-prep-kit/vllm_image:0.1
  - identifier: "number_input_tokens"
    propertyDomain:
      values: [1024, 2048, 4096]
  - identifier: "request_rate"
    propertyDomain:
      domainRange: [1,10]
      interval: 1
  - identifier: n_cpus
    propertyDomain:
      domainRange: [2,16]
      interval: 2
  - identifier: memory
    propertyDomain:
      values: ["128Gi", "256Gi"]
  - identifier: "max_batch_tokens"
    propertyDomain:
      values: [8192, 16384, 32768]
  - identifier: "max_num_seq"
    propertyDomain:
      values: [32,64]
  - identifier: "n_gpus"
    propertyDomain:
      values: [1]
  - identifier: "gpu_type"
    propertyDomain:
      values: ["NVIDIA-A100-80GB-PCIe"]
experiments:
  - actuatorIdentifier: vllm_performance
    experimentIdentifier: test-deployment-v1
metadata:
  description: A space of vllm deployment configurations
  name: vllm_deployments

Save the above as vllm_deployment_space.yaml. Then run:

ado create space -f vllm_deployment_space.yaml

Explore the space with random_walk¶

Next, we'll scan this space sequentially using a grouped sampler to increase efficiency. The grouped sampler ensures we explore all the different benchmark configurations for a given vLLM deployment before creating a new deployment - minimizing the number of deployment creations.

Save the following as random_walk_operation_grouped.yaml:

metadata:
  name: randomwalk-grouped-vllm-performance-full
spaces:
  - <Will be set via ado>
actuatorConfigurationIdentifiers:
  - <Will be set via ado>
operation:
  module:
    moduleClass: RandomWalk
  parameters:
    numberEntities: all
    batchSize: 1
    samplerConfig:
      mode: 'sequentialgrouped'
      samplerType: 'generator'
      grouping: #A unique combination of these properties is a new vLLM deployment
        - model
        - image
        - memory
        - max_batch_tokens
        - max_num_seq
        - n_gpus
        - gpu_type
        - n_cpus

Then, start the operation with:

ado create operation -f random_walk_operation_grouped.yaml \
           --use-latest space --use-latest actuatorconfiguration

As it runs a table of the results is updated live in the terminal as they come in.

Monitor the optimization¶

While the operation is running you can monitor the deployment:

# In a separate terminal
oc get deployments --watch -n vllm-testing

You can also get the results table by executing (in another terminal)

ado show entities operation --use-latest

Check final results¶

When the output indicates that the experiment has finished, you can inspect the results of all operations run so far on the space with:

ado show entities space --output-format csv --use-latest

Note

At any time after an operation, $OPERATION_ID, is finished you can run ado show entities operation $OPERATION_ID to see the sampling time-series of that operation.

Next steps¶

Try varying max_batch_tokens or gpu_memory_utilization to explore the impact on throughput.
Try creating a different actuatorconfiguration with more max_environments and running the random walk with a non-grouped sampler
Replace the model with a different HF checkpoint to compare performance.
Use RayTune (see the vLLM endpoint performance example) to optimise the hyper‑parameters of the benchmark.
Run the exploration on the OpenShift/Kubernetes cluster you create the deployments on, so you don't have to keep your laptop open.
Check the vllm_performance actuator documentation