Testing the throughput of an inference endpoint¶
The scenario
In this example, the vllm_performance actuator is used to find the maximum requests per second a server can handle while maintaining stable maximum throughput.
A model deployed for inference will have a certain max stable throughput in terms of the requests it can serve per second. Sending more requests than this maximum will often lead to a drop in throughput. Hence, it can be useful to know what this maximum is so the maximum throughput is reliably maintained e.g. by limiting the max number of concurrent requests.
To explore this space, you will:
- define an endpoint, model and range of requests per second to test
- use an optimizer to efficiently find the maximum requests per second
Prerequisites
- An endpoint serving an LLM via an OpenAI API-compatible API
- Install the following Python packages:
pip install hyperopt
pip install ado-ray-tune
pip install ado-vllm-performance
TL;DR
Get the files vllm_request_rate_space.yaml and operation_hyperopt.yaml from our repository.
vllm_request_rate_space.yaml: this file defines the endpoint, model, and request range to explore.- You must edit the model and endpoint fields in this file to match your own.
operation_hyperopt.yaml: this file contains the optimization parameters. You do not need to edit it.
Then, in a directory with these files, execute:
: # Define the set of request rates to explore
ado create space -f vllm_request_rate_space.yaml
: # Run the optimization!
ado create operation -f operation_hyperopt.yaml --use-latest space
Verify the installation¶
Execute:
ado get actuators --details
If the prerequisites (see above) have been installed correctly the actuator vllm_performance will appear in the list of available actuators
Define the request rates to test¶
The file vllm_request_rate_space.yaml defines a space with all request rates from 10 to 100 for an endpoint serving gpt-oss-20b. It's contents are:
# Example discovery space for vLLM performance
entitySpace:
- identifier: model
propertyDomain:
values:
- openai/gpt-oss-20b
- identifier: endpoint
propertyDomain:
values:
- http://localhost:8000
- identifier: request_rate
propertyDomain:
domainRange: [10,100]
interval: 1
experiments:
- actuatorIdentifier: vllm_performance
experimentIdentifier: test-endpoint-v1
Create the space with:
ado create space -f vllm_request_rate_space.yaml
Note
More complex discoveryspaces can be created, for example also including the number of input tokens. See Next Steps.
Use hyperopt to find the best input request rate¶
Hyperopt uses Tree-Parzen Estimators (TPE) which is a bayesian approach that is expected to be good for discrete dimensions and noisy metrics, which we have here i.e. request_throughput.
The file operation_hyperopt.yaml defines an optimization that will look for points (in this case request_rates)that result in a request_throughput within the top 20th percentile. The files contents look like:
spaces:
- <will be filled by ado>
operation:
module:
operatorName: "ray_tune"
operationType: "search"
parameters:
tuneConfig:
metric: "request_throughput" # The metric to optimize
mode: 'max'
num_samples: 16
search_alg:
name: hyperopt
n_initial_points: 8 #Number of points to sample before optimizing
gamma: 0.25 #The top gamma fraction of measured values are considered "good"
Create the operation with:
ado create operation -f hyperopt.yaml --use-latest space
Results will appear as they are measured.
Note
Hyperopt samples with replacement so you may see the same points sampled twice. The likelihood increases as the number of points in the space decreases The likelihood increase as number of points in the space decreases
Monitor the optimization¶
You can see the measurement requests as the operation runs by executing (in another terminal):
ado show requests operation --use-latest
and the results (this outputs the entities in sampled order):
ado show entities operation --use-latest
Instead of --use-latest you can also supply the operation id directly.
Check final results¶
When the output indicates that the experiment has finished, you can inspect the results of all operations run so far on the space with:
ado show entities space --output-format csv --use-latest
Note
At any time after an operation, $OPERATION_ID, is finished you can run ado show entities operation $OPERATION_ID to see the sampling time-series of that operation.
Some notes on hyperopt and TPE¶
What you should observe is that as the search proceeds hyperopt will begin to prefer sampling points in the region with stable maximum, even if it has seen better values in "unstable" regions.
Important
Do not just take the best point found by hyperopt but look at where it was focusing its attention
TPE builds models of where the "good" regions and "bad" regions of the discovery space are i.e. P(x|good), P(x|bad), where x is an input point. It then chooses new points to test by maximizing P(x|good)/P(x|bad)
This makes TPE robust to noise in request_throughput as it is not trying to find where the maximum is but is trying to find the request_rates that are most likely to give high throughput (above defined as throughput in top 20 percentile). This also makes it robust to outliers.
Issues may arise if the optimal region is not sampled in the initial points and this region is disjoint from other regions with "good" performance. As the search runs it will be directed towards where it has already seen good values and the best region is unlikely to be visited.
Tip
The number of samples hyperopt will use for first guess of good region is (n_initial_points)*gamma -> in above case 2, the other will be used for the "bad" region
Next steps¶
- Use
ado describe experiment vllm_performance_endpointto see what other parameters can be explored - Try varying
burstinessornumber_input_tokens, or adding them as dimensions of theentityspace, to explore their impact on throughput - Try varying
num_samples,gammaandn_initial_pointsparameters of hyperopt- You can keep running the optimization on the same
discoveryspace. The previous runs will not influence new runs, but their results will be reused, speeding experimentation up
- You can keep running the optimization on the same
- Measure the performance of vLLM deployment configurations
- Check the
vllm_performanceactuator documentation