Rate Limits¶

Note

Supported since version 1.2.10

A rate limit is the maximum rate of API calls allowed for a service instance per second. In IBM Cloud services, the rate limit depends on the instance.

Use Cases for Rate Limiting¶

The main purposes of rate limiting include:

Enforcing granular access control to resources.
Managing load to maintain a smooth and consistent user experience.
Protecting REST APIs from resource exhaustion (e.g., targeted DDoS attacks) and preventing general abuse.

Handling Rate Limits¶

If the limit is reached, an HTTP 429 Too Many Requests error response is returned. The response header will also contain information about the limit.

Mitigation Strategies¶

These values can be configured for the following classes: ModelInference, Embeddings.

The ibm-watsonx-ai client includes an in-built retry mechanism and traffic optimization to minimize excessive requests and respect the request rate limit for each instance. It is recommended to start with the default settings. However, since predicting instance usage can be difficult, the retry mechanism provides three configurable options for handling unsuccessful requests:

max_retries – The maximum number of retry attempts when an error code in retry_status_codes is received. Defaults to 10.
delay_time – The factor used in exponential backoff: min(delay_time * pow(2.0, attempt), MAX_RETRY_DELAY). The default value for delay_time is 0.5, and MAX_RETRY_DELAY is 8 seconds.
retry_status_codes – The list of HTTP status codes for which the retry mechanism should be applied. Defaults to [429, 503, 504, 520]. The 429 Too Many Requests error is handled differently: retries occur as soon as the next available slot is free, whereas other status codes follow an exponential backoff strategy.

Summary¶

The default values should be sufficient to handle requests without requiring additional configuration while avoiding excessive requests. However, depending on traffic patterns, they may still interfere with operations. To mitigate this, the retry mechanism prioritizes 429 responses while allowing customization of max_retries, delay_time, and retry_status_codes.