Gateway Tuning Guide¶
This page collects practical levers for squeezing the most performance, reliability, and observability out of MCP Gateway-no matter where you run the container (Code Engine, Kubernetes, Docker Compose, Nomad, etc.).
TL;DR
- Tune the runtime environment via
.envand configure mcpgateway to use PostgreSQL and Redis.- Adjust Gunicorn workers & time-outs in
gunicorn.conf.py.- Right-size CPU/RAM for the container or spin up more instances (with shared Redis state) and change the database settings (ex: connection limits).
- Benchmark with hey (or your favourite load-generator) before & after. See also: performance testing guide
1 - Environment variables (.env)¶
| Variable | Default | Why you might change it |
|---|---|---|
AUTH_REQUIRED | true | Disable for internal/behind-VPN deployments to shave a few ms per request. |
JWT_SECRET_KEY | random | Longer key ➜ slower HMAC verify; still negligible-leave as is. |
CACHE_TYPE | database | Switch to redis or memory if your workload is read-heavy and latency-sensitive. |
DATABASE_URL | SQLite | Move to managed PostgreSQL + connection pooling for anything beyond dev tests. |
HOST/PORT | 0.0.0.0:4444 | Expose a different port or bind only to 127.0.0.1 behind a reverse-proxy. |
Redis Connection Pool Tuning¶
When using CACHE_TYPE=redis, tune the connection pool for your workload:
| Variable | Default | Tuning Guidance |
|---|---|---|
REDIS_MAX_CONNECTIONS | 50 | Pool size per worker. Formula: (concurrent_requests / workers) × 1.5 |
REDIS_SOCKET_TIMEOUT | 2.0 | Lower (1.0s) for high-concurrency; Redis ops typically <100ms |
REDIS_SOCKET_CONNECT_TIMEOUT | 2.0 | Keep low to fail fast on network issues |
REDIS_HEALTH_CHECK_INTERVAL | 30 | Lower (15s) for production to detect stale connections faster |
High-concurrency production settings:
REDIS_MAX_CONNECTIONS=100
REDIS_SOCKET_TIMEOUT=1.0
REDIS_SOCKET_CONNECT_TIMEOUT=1.0
REDIS_HEALTH_CHECK_INTERVAL=15
Tip Any change here requires rebuilding or restarting the container if you pass the file with
--env-file.
2 - Gunicorn settings (gunicorn.conf.py)¶
| Knob | Purpose | Rule of thumb |
|---|---|---|
workers | Parallel processes | 2-4 × vCPU for CPU-bound work; fewer if memory-bound. |
threads | Per-process threads | Use only with sync worker; keeps memory low for I/O workloads. |
timeout | Kill stuck worker | Set ≥ end-to-end model latency. E.g. 600 s for LLM calls. |
preload_app | Load app once | Saves RAM; safe for pure-Python apps. |
worker_class | Async workers | gevent or eventlet for many concurrent requests / websockets. |
max_requests(+_jitter) | Self-healing | Recycle workers to mitigate memory leaks. |
Edit the file before building the image, then redeploy.
2b - Uvicorn Performance Extras¶
MCP Gateway uses uvicorn[standard] which includes high-performance components that are automatically detected and used:
| Package | Purpose | Platform | Improvement |
|---|---|---|---|
uvloop | Fast event loop (libuv-based, Cython) | Linux, macOS | 20-40% lower latency |
httptools | Fast HTTP parsing (C extension) | All platforms | 40-60% faster parsing |
websockets | Optimized WebSocket handling | All platforms | Better WS performance |
watchfiles | Fast file watching for --reload | All platforms | Faster dev cycle |
Automatic Detection¶
When Gunicorn spawns Uvicorn workers, these components are automatically detected:
# Verify extras are installed
pip list | grep -E "uvloop|httptools|websockets|watchfiles"
# Expected output (Linux/macOS):
# httptools 0.6.x
# uvloop 0.21.x
# websockets 15.x.x
# watchfiles 1.x.x
Platform Notes¶
- Linux/macOS: Full performance benefits (uvloop + httptools)
- Windows: httptools provides benefits; uvloop unavailable (graceful fallback to asyncio)
Performance Impact¶
Combined improvements from uvloop and httptools:
| Workload | Improvement |
|---|---|
| Simple JSON endpoints | 15-25% faster |
| High-concurrency requests | 20-30% higher throughput |
| WebSocket connections | Lower latency, better handling |
Development --reload | Faster file change detection |
Note: These optimizations are transparent - no code or configuration changes needed.
2c - Granian (Alternative HTTP Server)¶
MCP Gateway supports two HTTP servers: - Gunicorn + Uvicorn (default) - Battle-tested, mature, excellent stability - Granian (alternative) - Rust-based, native HTTP/2, lower memory
Usage¶
# Local development
make serve # Gunicorn + Uvicorn (default)
make serve-granian # Granian (alternative)
make serve-granian-http2 # Granian with HTTP/2 + TLS
# Container with Gunicorn (default)
make container-run
make container-run-gunicorn-ssl
# Container with Granian (alternative)
make container-run-granian
make container-run-granian-ssl
# Docker Compose (default uses Gunicorn)
docker compose up
Switching HTTP Servers¶
The HTTP_SERVER environment variable controls which server to use:
# Docker/Podman - use Gunicorn (default)
docker run mcpgateway/mcpgateway
# Docker/Podman - use Granian
docker run -e HTTP_SERVER=granian mcpgateway/mcpgateway
# Docker Compose - set in environment section
environment:
- HTTP_SERVER=gunicorn # default
# - HTTP_SERVER=granian # alternative
Configuration¶
| Variable | Default | Description |
|---|---|---|
GRANIAN_WORKERS | auto (CPU cores, max 16) | Worker processes |
GRANIAN_RUNTIME_MODE | auto (mt if >8 workers) | Runtime mode: mt (multi-threaded), st (single-threaded) |
GRANIAN_RUNTIME_THREADS | 1 | Runtime threads per worker |
GRANIAN_BLOCKING_THREADS | 1 | Blocking threads per worker |
GRANIAN_HTTP | auto | HTTP version: auto, 1, 2 |
GRANIAN_LOOP | uvloop | Event loop: uvloop, asyncio, rloop |
GRANIAN_TASK_IMPL | auto | Task implementation: asyncio (Python 3.12+), rust (older) |
GRANIAN_HTTP1_PIPELINE_FLUSH | true | Aggregate HTTP/1 flushes for pipelined responses |
GRANIAN_HTTP1_BUFFER_SIZE | 524288 | HTTP/1 buffer size (512KB) |
GRANIAN_BACKLOG | 2048 | Connection backlog for high concurrency |
GRANIAN_BACKPRESSURE | 512 | Max concurrent requests per worker |
GRANIAN_RESPAWN_FAILED | true | Auto-restart failed workers |
GRANIAN_DEV_MODE | false | Enable hot reload |
DISABLE_ACCESS_LOG | true | Disable access logging for performance |
Performance tuning profiles:
# High-throughput (fewer workers, more threads per worker)
GRANIAN_WORKERS=4 GRANIAN_RUNTIME_THREADS=4 make serve
# High-concurrency (more workers, max backpressure)
GRANIAN_WORKERS=16 GRANIAN_BACKPRESSURE=1024 GRANIAN_BACKLOG=4096 make serve
# Memory-constrained (fewer workers)
GRANIAN_WORKERS=2 make serve
# Force HTTP/1 only (avoids HTTP/2 overhead)
GRANIAN_HTTP=1 make serve
Notes: - On Python 3.12+, the Rust task implementation is unavailable; asyncio is used automatically - uvloop provides best performance on Linux/macOS - Increase GRANIAN_BACKLOG and GRANIAN_BACKPRESSURE for high-concurrency workloads
When to Use Granian¶
| Use Granian when… | Use Gunicorn when… |
|---|---|
| You want native HTTP/2 | Maximum stability needed |
| Optimizing for memory | Familiar with Gunicorn |
| Simplest deployment | Need gevent/eventlet workers |
| Benchmarks show gains | Behind HTTP/2 proxy already |
Performance Comparison¶
| Metric | Gunicorn+Uvicorn | Granian |
|---|---|---|
| Simple JSON | Baseline | +20-50% (varies) |
| Memory/worker | ~80MB | ~40MB |
| HTTP/2 | Via proxy | Native |
Note: Always benchmark with your specific workload before switching servers.
Real-World Performance (Database-Bound Workload)¶
Under load testing with 2500 concurrent users against PostgreSQL:
| Metric | Gunicorn | Granian | Winner |
|---|---|---|---|
| Memory per replica | ~2.7 GiB | ~4.0 GiB | Gunicorn (32% less) |
| CPU per replica | ~740% | ~680% | Granian (8% less) |
| Throughput (RPS) | ~2000 | ~2000 | Tie (DB bottleneck) |
| Backpressure | ❌ None | ✅ Native | Granian |
| Overload behavior | Queues → OOM/timeout | 503 rejection | Granian |
Key Finding: When the database is the bottleneck, both servers achieve similar throughput. The main differences are:
- Memory: Gunicorn uses 32% less RAM (fork-based model with copy-on-write)
- CPU: Granian uses 8% less CPU (more efficient HTTP parsing in Rust)
- Stability: Granian handles overload gracefully (backpressure), Gunicorn queues indefinitely
Recommendation:
| Scenario | Choose |
|---|---|
| Memory-constrained | Gunicorn |
| Load spike protection | Granian |
| Bursty/unpredictable traffic | Granian |
| Stable traffic patterns | Either |
3 - Container resources¶
| vCPU × RAM | Good for | Notes |
|---|---|---|
0.5 × 1 GB | Smoke tests / CI | Smallest footprint; likely CPU-starved under load. |
1 × 4 GB | Typical dev / staging | Handles a few hundred RPS with default 8 workers. |
2 × 8 GB | Small prod | Allows ~16-20 workers; good concurrency. |
4 × 16 GB+ | Heavy prod | Combine with async workers or autoscaling. |
Always test with your workload; JSON-RPC payload size and backend model latency change the equation.
To change your database connection settings, see the respective documentation for your selected database or managed service. For example, when using IBM Cloud Databases for PostgreSQL - you can raise the maximum number of connections.
4 - Performance testing¶
4.1 Tooling: hey¶
Install one of:
brew install hey # macOS
sudo apt install hey # Debian/Ubuntu
# or build from source
go install github.com/rakyll/hey@latest # $GOPATH/bin must be in PATH
4.2 Sample load-test script (tests/hey.sh)¶
#!/usr/bin/env bash
# Run 10 000 requests with 200 concurrent workers.
JWT="$(cat jwt.txt)" # <- place a valid token here
hey -n 10000 -c 200 \
-m POST \
-T application/json \
-H "Authorization: Bearer ${JWT}" \
-D tests/hey/payload.json \
http://localhost:4444/rpc
Payload (tests/hey/payload.json)
{
"jsonrpc": "2.0",
"id": 1,
"method": "convert_time",
"params": {
"source_timezone": "Europe/Berlin",
"target_timezone": "Europe/Dublin",
"time": "09:00"
}
}
4.3 Reading the output¶
hey prints latency distribution, requests/second, and error counts. Focus on:
- 99th percentile latency - adjust
timeoutif it clips. - Errors - 5xx under load often mean too few workers or DB connections.
- Throughput (RPS) - compare before/after tuning.
4.4 Common bottlenecks & fixes¶
| Symptom | Likely cause | Mitigation |
|---|---|---|
| High % of 5xx under load | Gunicorn workers exhausted | Increase workers, switch to async workers, raise CPU. |
| Latency > timeout | Long model call / external API | Increase timeout, add queueing, review upstream latency. |
| Memory OOM | Too many workers / large batch size | Lower workers, disable preload_app, add RAM. |
5 - Logging & observability¶
- Set
loglevel = "debug"ingunicorn.conf.pyduring tests; revert toinfoin prod. - Forward
stdout/stderrfrom the container to your platform's log stack (e.g.kubectl logs,docker logs). - Expose
/metricsvia a Prometheus exporter (planned) for request timing & queue depth; track enablement in the project roadmap.
6 - Security tips while tuning¶
- Never commit real
JWT_SECRET_KEY, DB passwords, or tokens-use.env.exampleas a template. - Prefer platform secrets (K8s Secrets, Code Engine secrets) over baking creds into the image.
- If you enable
gevent/eventlet, pin their versions and run bandit or trivy scans.