Skip to content

AI Inference Server

Solution report card
Runs on IBM i?
On-prem
IBM Cloud
AI capabilitiesGenerative AI
Agentic AI
Commercial support
Free to try?
Requirements

Red Hat AI Inference Server (formerly known as vLLM-based inference) is Red Hat’s enterprise-supported solution for serving large language models at production scale. It provides a high-performance, OpenAI-compatible REST API for LLM inference, backed by Red Hat’s support and lifecycle commitments.

Built on the open-source vLLM engine, Red Hat AI Inference Server is optimized for throughput and latency at scale — using techniques like continuous batching and PagedAttention to maximize GPU (or CPU/accelerator) utilization.

IBM i applications can call the inference server’s OpenAI-compatible API using standard HTTP, SQL HTTP functions (SYSTOOLS.HTTPPOSTCLOB), or the Db2 for i AI SDK. This means IBM i can consume LLM capabilities from a supported, on-premises inference server without sending data to the cloud.

Key benefits for IBM i scenarios:

  • On-premises deployment — Keep sensitive business data within your network
  • OpenAI-compatible API — Drop-in compatible with tools and code written for OpenAI; no proprietary lock-in
  • IBM Power support — Can run on IBM Power (Linux) infrastructure adjacent to IBM i
  • Enterprise support — Red Hat support contract covers the inference server

Red Hat AI Inference Server is delivered as a container image, deployable on OpenShift or standalone container runtimes. It can run on IBM Power (via ppc64le container images) on a Linux partition adjacent to IBM i.

See also: Red Hat OpenShift AI for the broader MLOps platform that includes model training, pipelines, and governance alongside inference.