Testing the throughput of an inference endpoint¶
Note
This example illustrates using the vllm-performance actuator to test the throughput of an OpenAPI compatible inference endpoint
Important
Prerequisites
- An endpoint serving an LLM in an OpenAI API-compatible format
- The
ray_tuneoperator with hyperopt installed
pip install ado-ray-tune
pip install hyperopt
The scenario¶
A model deployed for inference will have a certain max stable throughput in terms of the requests it can serve per second. Sending more requests than this maximum will often lead to a drop in throughput. Hence, it can be useful to know what this maximum is so the maximum throughput is reliably maintained e.g. by limiting the max number of concurrent requests.
In this example, the vllm_performance actuator is used to find the maximum requests per second a server can handle while maintaining stable maximum throughput.
To explore this space, you will:
- define an endpoint, model and range of requests per second to test
- use an optimizer to efficiently find the maximum requests per second
Install the actuator¶
Execute:
pip install -e plugins/actuators/vllm_performance
in the root of the ado source repository. You can clone the repository with
git clone https://github.com/IBM/ado.git
Verify the installation with:
ado get actuators --details
The actuator vllm_performance will appear in the list of available actuators.
Define the request rates to test¶
This discoveryspace includes all request rates from 10 to 100 for an endpoint serving gpt-oss-20b:
# Example discovery space for vLLM performance
sampleStoreIdentifier: <sample_store_id>
entitySpace:
- identifier: model
propertyDomain:
values:
- openai/gpt-oss-20b
- identifier: endpoint
propertyDomain:
values:
- http://localhost:8000
- identifier: request_rate
propertyDomain:
domainRange: [10,100]
interval: 1
experiments:
- actuatorIdentifier: vllm_performance
experimentIdentifier: performance-testing-endpoint
Save the above as vllm_discoveryspace.yaml. Then, if you have an existing samplestore, run:
ado create space -f vllm_discoveryspace.yaml --set sampleStoreIdentifier=$SAMPLE_STORE_ID
otherwise create a new one:
ado create space -f vllm_discoveryspace.yaml --new-sample-store
Record the identifier of the created discoveryspace as it will be used in next section.
Note
More complex discoveryspaces can be created, for example also including the number of input tokens. See Next Steps.
Use hyperopt to find the best input request rate¶
Hyperopt uses Tree-Parzen Estimators (TPE) which is a bayesian approach that is expected to be good for discrete dimensions and noisy metrics, which we have here i.e. request_throughput.
The following operation will look for points (in this case request_rates) which lead to a request_throughput in the top 20 percentile:
spaces:
- space-ccf2bf-a50274 #substitute with your space id or override when running
operation:
module:
operatorName: "ray_tune"
operationType: "search"
parameters:
tuneConfig:
metric: "request_throughput" # The metric to optimize
mode: 'max'
num_samples: 16
search_alg:
name: hyperopt
n_initial_points: 8 #Number of points to sample before optimizing
gamma: 0.25 #The top gamma fraction of measured values are considered "good"
Save the above as hyperopt.yaml. Then create the operation:
ado create operation -f hyperopt.yaml --set "spaces[0]=$DISCOVERY_SPACE_ID"
where $DISCOVERY_SPACE_ID is the identifier of the discoveryspace you created in the previous step.
Note
Hyperopt samples with replacement so you may see the same points sampled twice. The likelihood increase as number of points in the space decreases
Monitor the optimization¶
You can see the measurement requests as the operation runs by executing (in another terminal):
ado show requests operation $OPERATION_ID
and the results (this outputs the entities in sampled order):
ado show entities operation $OPERATION_ID
If the operation is running the $OPERATION_ID will have been output just before the sampling started. Assuming no other operation was started it will also be the last id output by
ado get operations
Check final results¶
When the output indicates that the experiment has finished, you can inspect the results of all operations run so far on the space with:
ado show entities space $DISCOVERY_SPACE_ID --output-format csv
Note
At any time after an operation, $OPERATION_ID, is finished you can run ado show entities operation $OPERATION_ID to see the sampling time-series of that operation.
Some notes on hyperopt and TPE¶
What you should observe is that as the search proceeds hyperopt will start to prefer to sample points in the region with stable maximum, even if it has seen better values in "unstable" regions.
Important
Do not just take the best point found by hyperopt but look at where it was focusing its attention
TPE builds models of where the "good" regions and "bad" regions of the discovery space are i.e. P(x|good), P(x|bad), where x is an input point. It then chooses new points to test by maximizing P(x|good)/P(x|bad)
This makes TPE robust to noise in request_throughput as its not trying to find where the maximum is but is trying to find the request_rates that are most likely to give high throughput (above defined as throughput in top 20 percentile). This also makes it robust to outliers.
Problems can arise if the best region is not sampled in the initial points and this region is disjoint from other regions with "good" performance. As the search runs it will be directed towards where it has already seen good values and the best region is unlikely to be visited.
Tip
The number of samples hyperopt will use for first guess of good region is (n_initial_points)*gamma -> in above case 2, the other will be used for the "bad" region
Next steps¶
- Use
ado describe experiment vllm_performance_endpointto see what other parameters can be explored - Try varying
burstinessornumber_input_tokens, or adding them as dimensions of theentityspace, to explore their impact on throughput - Try varying
num_samples,gammaandn_initial_pointsparameters of hyperopt - You can keep running the optimization on the same
discoveryspace. The previous runs will not influence new runs, but their results will be reused, speeding experimentation up - Measure the performance of vLLM deployment configurations