Gateway Tuning Guide¶
This page collects practical levers for squeezing the most performance, reliability, and observability out of ContextForge-no matter where you run the container (Code Engine, Kubernetes, Docker Compose, Nomad, etc.).
TL;DR
- Tune the runtime environment via
.envand configure mcpgateway to use PostgreSQL and Redis.- Adjust Gunicorn workers & time-outs in
gunicorn.conf.py.- Right-size CPU/RAM for the container or spin up more instances (with shared Redis state) and change the database settings (ex: connection limits).
- Benchmark with hey (or your favourite load-generator) before & after. See also: performance testing guide
1 - Environment variables (.env)¶
| Variable | Default | Why you might change it |
|---|---|---|
AUTH_REQUIRED | true | Disable for internal/behind-VPN deployments to shave a few ms per request. |
JWT_SECRET_KEY | random | Longer key ➜ slower HMAC verify; still negligible-leave as is. |
CACHE_TYPE | database | Switch to redis or memory if your workload is read-heavy and latency-sensitive. |
DATABASE_URL | SQLite | Move to managed PostgreSQL + connection pooling for anything beyond dev tests. |
HOST/PORT | 0.0.0.0:4444 | Expose a different port or bind only to 127.0.0.1 behind a reverse-proxy. |
Redis Connection Pool Tuning¶
When using CACHE_TYPE=redis, tune the connection pool for your workload:
| Variable | Default | Tuning Guidance |
|---|---|---|
REDIS_MAX_CONNECTIONS | 50 | Pool size per worker. Formula: (concurrent_requests / workers) × 1.5 |
REDIS_SOCKET_TIMEOUT | 2.0 | Lower (1.0s) for high-concurrency; Redis ops typically <100ms |
REDIS_SOCKET_CONNECT_TIMEOUT | 2.0 | Keep low to fail fast on network issues |
REDIS_HEALTH_CHECK_INTERVAL | 30 | Lower (15s) for production to detect stale connections faster |
High-concurrency production settings:
REDIS_MAX_CONNECTIONS=100
REDIS_SOCKET_TIMEOUT=1.0
REDIS_SOCKET_CONNECT_TIMEOUT=1.0
REDIS_HEALTH_CHECK_INTERVAL=15
Tip Any change here requires rebuilding or restarting the container if you pass the file with
--env-file.
2 - Gunicorn settings (gunicorn.conf.py)¶
| Knob | Purpose | Rule of thumb |
|---|---|---|
workers | Parallel processes | 2-4 × vCPU for CPU-bound work; fewer if memory-bound. |
threads | Per-process threads | Use only with sync worker; keeps memory low for I/O workloads. |
timeout | Kill stuck worker | Set ≥ end-to-end model latency. E.g. 600 s for LLM calls. |
preload_app | Load app once | Saves RAM; safe for pure-Python apps. |
worker_class | Async workers | gevent or eventlet for many concurrent requests / websockets. |
max_requests(+_jitter) | Self-healing | Recycle workers to mitigate memory leaks. |
Edit the file before building the image, then redeploy.
2b - Uvicorn Performance Extras¶
ContextForge uses uvicorn[standard] which includes high-performance components that are automatically detected and used:
| Package | Purpose | Platform | Improvement |
|---|---|---|---|
uvloop | Fast event loop (libuv-based, Cython) | Linux, macOS | 20-40% lower latency |
httptools | Fast HTTP parsing (C extension) | All platforms | 40-60% faster parsing |
websockets | Optimized WebSocket handling | All platforms | Better WS performance |
watchfiles | Fast file watching for --reload | All platforms | Faster dev cycle |
Automatic Detection¶
When Gunicorn spawns Uvicorn workers, these components are automatically detected:
# Verify extras are installed
pip list | grep -E "uvloop|httptools|websockets|watchfiles"
# Expected output (Linux/macOS):
# httptools 0.6.x
# uvloop 0.21.x
# websockets 15.x.x
# watchfiles 1.x.x
Platform Notes¶
- Linux/macOS: Full performance benefits (uvloop + httptools)
- Windows: httptools provides benefits; uvloop unavailable (graceful fallback to asyncio)
Performance Impact¶
Combined improvements from uvloop and httptools:
| Workload | Improvement |
|---|---|
| Simple JSON endpoints | 15-25% faster |
| High-concurrency requests | 20-30% higher throughput |
| WebSocket connections | Lower latency, better handling |
Development --reload | Faster file change detection |
Note: These optimizations are transparent - no code or configuration changes needed.
2c - Granian (Alternative HTTP Server)¶
ContextForge supports two HTTP servers:
- Gunicorn + Uvicorn (default) - Battle-tested, mature, excellent stability
- Granian (alternative) - Rust-based, native HTTP/2, lower memory
Usage¶
# Local development
make serve # Gunicorn + Uvicorn (default)
make serve-granian # Granian (alternative)
make serve-granian-http2 # Granian with HTTP/2 + TLS
# Container with Gunicorn (default)
make container-run
make container-run-gunicorn-ssl
# Container with Granian (alternative)
make container-run-granian
make container-run-granian-ssl
# Docker Compose (default uses Gunicorn)
docker compose up
Switching HTTP Servers¶
The HTTP_SERVER environment variable controls which server to use:
# Docker/Podman - use Gunicorn (default)
docker run mcpgateway/mcpgateway
# Docker/Podman - use Granian
docker run -e HTTP_SERVER=granian mcpgateway/mcpgateway
# Docker Compose - set in environment section
environment:
- HTTP_SERVER=gunicorn # default
# - HTTP_SERVER=granian # alternative
Configuration¶
| Variable | Default | Description |
|---|---|---|
GRANIAN_WORKERS | auto (CPU cores, max 16) | Worker processes |
GRANIAN_RUNTIME_MODE | auto (mt if >8 workers) | Runtime mode: mt (multi-threaded), st (single-threaded) |
GRANIAN_RUNTIME_THREADS | 1 | Runtime threads per worker |
GRANIAN_BLOCKING_THREADS | 1 | Blocking threads per worker |
GRANIAN_HTTP | auto | HTTP version: auto, 1, 2 |
GRANIAN_LOOP | uvloop | Event loop: uvloop, asyncio, rloop |
GRANIAN_TASK_IMPL | auto | Task implementation: asyncio (Python 3.12+), rust (older) |
GRANIAN_HTTP1_PIPELINE_FLUSH | true | Aggregate HTTP/1 flushes for pipelined responses |
GRANIAN_HTTP1_BUFFER_SIZE | 524288 | HTTP/1 buffer size (512KB) |
GRANIAN_BACKLOG | 2048 | Connection backlog for high concurrency |
GRANIAN_BACKPRESSURE | 512 | Max concurrent requests per worker |
GRANIAN_RESPAWN_FAILED | true | Auto-restart failed workers |
GRANIAN_DEV_MODE | false | Enable hot reload |
DISABLE_ACCESS_LOG | true | Disable access logging for performance |
TEMPLATES_AUTO_RELOAD | false | Disable Jinja2 template auto-reload for production |
Performance tuning profiles:
# High-throughput (fewer workers, more threads per worker)
GRANIAN_WORKERS=4 GRANIAN_RUNTIME_THREADS=4 make serve
# High-concurrency (more workers, max backpressure)
GRANIAN_WORKERS=16 GRANIAN_BACKPRESSURE=1024 GRANIAN_BACKLOG=4096 make serve
# Memory-constrained (fewer workers)
GRANIAN_WORKERS=2 make serve
# Force HTTP/1 only (avoids HTTP/2 overhead)
GRANIAN_HTTP=1 make serve
Notes:
- On Python 3.12+, the Rust task implementation is unavailable; asyncio is used automatically
uvloopprovides best performance on Linux/macOS- Increase
GRANIAN_BACKLOGandGRANIAN_BACKPRESSUREfor high-concurrency workloads
Backpressure for Overload Protection¶
Granian's native backpressure prevents unbounded request queuing during overload. When the server reaches capacity, excess requests receive immediate 503 responses instead of waiting in a queue (which can cause memory exhaustion or cascading timeouts).
How it works:
Incoming Request
│
▼
┌──────────────────────────────────┐
│ Granian Worker (1 of N) │
│ │
│ current_requests < BACKPRESSURE?│
│ │ │
│ ├── YES → Process request │
│ │ │
│ └── NO → Immediate 503 │
│ (no queuing) │
└──────────────────────────────────┘
Capacity calculation:
Total capacity = GRANIAN_WORKERS × GRANIAN_BACKPRESSURE
Example with recommended settings:
Workers: 16
Backpressure: 64
Total: 16 × 64 = 1024 concurrent requests
Requests 1-1024: Processed normally
Request 1025+: Immediate 503 Service Unavailable
Recommended production settings:
# docker-compose.yml or Kubernetes
environment:
- HTTP_SERVER=granian
- GRANIAN_WORKERS=16
- GRANIAN_BACKLOG=4096 # OS socket queue for pending connections
- GRANIAN_BACKPRESSURE=64 # Per-worker limit (16×64=1024 total)
Benefits over unbounded queuing:
| Behavior | Without Backpressure | With Backpressure |
|---|---|---|
| Under overload | Requests queue indefinitely | Excess rejected immediately |
| Memory usage | Grows unbounded → OOM | Stays bounded |
| Client experience | Long timeouts, then failure | Fast 503, can retry |
| Health checks | May timeout (queued) | Always respond quickly |
| Recovery | Slow (drain queue) | Instant (no queue) |
When to Use Granian¶
| Use Granian when… | Use Gunicorn when… |
|---|---|
| You want native HTTP/2 | Maximum stability needed |
| Optimizing for memory | Familiar with Gunicorn |
| Simplest deployment | Need gevent/eventlet workers |
| Benchmarks show gains | Behind HTTP/2 proxy already |
Performance Comparison¶
| Metric | Gunicorn+Uvicorn | Granian |
|---|---|---|
| Simple JSON | Baseline | +20-50% (varies) |
| Memory/worker | ~80MB | ~40MB |
| HTTP/2 | Via proxy | Native |
Note: Always benchmark with your specific workload before switching servers.
Real-World Performance (Database-Bound Workload)¶
Under load testing with 2500 concurrent users against PostgreSQL:
| Metric | Gunicorn | Granian | Winner |
|---|---|---|---|
| Memory per replica | ~2.7 GiB | ~4.0 GiB | Gunicorn (32% less) |
| CPU per replica | ~740% | ~680% | Granian (8% less) |
| Throughput (RPS) | ~2000 | ~2000 | Tie (DB bottleneck) |
| Backpressure | ❌ None | ✅ Native | Granian |
| Overload behavior | Queues → OOM/timeout | 503 rejection | Granian |
Key Finding: When the database is the bottleneck, both servers achieve similar throughput. The main differences are:
- Memory: Gunicorn uses 32% less RAM (fork-based model with copy-on-write)
- CPU: Granian uses 8% less CPU (more efficient HTTP parsing in Rust)
- Stability: Granian handles overload gracefully (backpressure), Gunicorn queues indefinitely
Recommendation:
| Scenario | Choose |
|---|---|
| Memory-constrained | Gunicorn |
| Load spike protection | Granian |
| Bursty/unpredictable traffic | Granian |
| Stable traffic patterns | Either |
3 - Container resources¶
| vCPU × RAM | Good for | Notes |
|---|---|---|
0.5 × 1 GB | Smoke tests / CI | Smallest footprint; likely CPU-starved under load. |
1 × 4 GB | Typical dev / staging | Handles a few hundred RPS with default 8 workers. |
2 × 8 GB | Small prod | Allows ~16-20 workers; good concurrency. |
4 × 16 GB+ | Heavy prod | Combine with async workers or autoscaling. |
Always test with your workload; JSON-RPC payload size and backend model latency change the equation.
To change your database connection settings, see the respective documentation for your selected database or managed service. For example, when using IBM Cloud Databases for PostgreSQL - you can raise the maximum number of connections.
4 - Performance testing¶
4.1 Tooling: hey¶
Install one of:
brew install hey # macOS
sudo apt install hey # Debian/Ubuntu
# or build from source
go install github.com/rakyll/hey@latest # $GOPATH/bin must be in PATH
4.2 Sample load-test script (tests/hey.sh)¶
#!/usr/bin/env bash
# Run 10 000 requests with 200 concurrent workers.
JWT="$(cat jwt.txt)" # <- place a valid token here
hey -n 10000 -c 200 \
-m POST \
-T application/json \
-H "Authorization: Bearer ${JWT}" \
-D tests/hey/payload.json \
http://localhost:4444/rpc
Payload (tests/hey/payload.json)
{
"jsonrpc": "2.0",
"id": 1,
"method": "convert_time",
"params": {
"source_timezone": "Europe/Berlin",
"target_timezone": "Europe/Dublin",
"time": "09:00"
}
}
4.3 Reading the output¶
hey prints latency distribution, requests/second, and error counts. Focus on:
- 99th percentile latency - adjust
timeoutif it clips. - Errors - 5xx under load often mean too few workers or DB connections.
- Throughput (RPS) - compare before/after tuning.
4.4 Common bottlenecks & fixes¶
| Symptom | Likely cause | Mitigation |
|---|---|---|
| High % of 5xx under load | Gunicorn workers exhausted | Increase workers, switch to async workers, raise CPU. |
| Latency > timeout | Long model call / external API | Increase timeout, add queueing, review upstream latency. |
| Memory OOM | Too many workers / large batch size | Lower workers, disable preload_app, add RAM. |
5 - Logging & observability¶
- Set
loglevel = "debug"ingunicorn.conf.pyduring tests; revert toinfoin prod. - Forward
stdout/stderrfrom the container to your platform's log stack (e.g.kubectl logs,docker logs). - Expose
/metricsvia a Prometheus exporter (planned) for request timing & queue depth; track enablement in the project roadmap.
6 - MCP Session Pool Tuning¶
The MCP session pool maintains persistent connections to upstream MCP servers, providing 10-20x latency improvement for repeated tool calls from the same user.
Disabled by Default
Session pooling is disabled by default for safety. Enable it explicitly after testing in your environment:
When to Enable Pooling¶
| Enable pooling when… | Avoid or tighten isolation when… |
|---|---|
| MCP servers are stable and latency matters | MCP servers maintain per-session state |
| You can tolerate session reuse within user/tenant scope | You rely on request-scoped headers for security/tracing |
| High-throughput tool invocations | Long-running tools (>30s) need custom timeouts |
Configuration Variables¶
| Variable | Default | Description |
|---|---|---|
MCP_SESSION_POOL_ENABLED | false | Enable/disable session pooling |
MCP_SESSION_POOL_MAX_PER_KEY | 10 | Max sessions per (URL, identity, transport). Increase to 50-200 for high concurrency. |
MCP_SESSION_POOL_TTL | 300.0 | Session TTL before forced close (seconds) |
MCP_SESSION_POOL_TRANSPORT_TIMEOUT | 30.0 | Timeout for all HTTP operations (seconds) |
MCP_SESSION_POOL_HEALTH_CHECK_INTERVAL | 60.0 | Idle time before health check (seconds) |
MCP_SESSION_POOL_HEALTH_CHECK_METHODS | ping,skip | Ordered list of health check methods (ping, list_tools, list_prompts, list_resources, skip) |
MCP_SESSION_POOL_HEALTH_CHECK_TIMEOUT | 5.0 | Timeout per health check attempt (seconds) |
MCP_SESSION_POOL_ACQUIRE_TIMEOUT | 30.0 | Timeout waiting for session slot |
MCP_SESSION_POOL_CREATE_TIMEOUT | 30.0 | Timeout creating new session |
MCP_SESSION_POOL_IDLE_EVICTION | 600.0 | Evict idle pool keys after (seconds) |
MCP_SESSION_POOL_CIRCUIT_BREAKER_THRESHOLD | 5 | Consecutive failures before circuit opens |
MCP_SESSION_POOL_CIRCUIT_BREAKER_RESET | 60.0 | Circuit reset time (seconds) |
Recommended Production Settings¶
# Baseline settings for authenticated deployments (low-to-moderate traffic)
MCP_SESSION_POOL_ENABLED=true
MCP_SESSION_POOL_MAX_PER_KEY=10
MCP_SESSION_POOL_TTL=300
MCP_SESSION_POOL_TRANSPORT_TIMEOUT=30
# High-concurrency settings (1000+ concurrent users)
MCP_SESSION_POOL_ENABLED=true
MCP_SESSION_POOL_MAX_PER_KEY=200 # 50-200 for high concurrency
MCP_SESSION_POOL_TTL=300
MCP_SESSION_POOL_TRANSPORT_TIMEOUT=30
MCP_SESSION_POOL_ACQUIRE_TIMEOUT=60 # Longer timeout under load
# Ensure identity headers are present
ENABLE_HEADER_PASSTHROUGH=true
DEFAULT_PASSTHROUGH_HEADERS="Authorization,X-Tenant-Id,X-User-Id,X-API-Key"
Session Isolation¶
Sessions are isolated by a composite key: (URL, identity_hash, transport_type). Identity is derived from authentication headers (Authorization, X-Tenant-ID, X-User-ID, X-API-Key, Cookie).
Key security considerations:
- Anonymous Pooling: When no identity headers are present, identity collapses to
"anonymous"and all such requests share sessions. This is safe only if upstream MCP servers are stateless. - Shared Credentials: With OAuth Client Credentials or static API keys, all users share the same identity hash. Only safe if the upstream MCP server has no per-user state.
- Header Passthrough: If gateway auth is disabled (
AUTH_REQUIRED=false), enable header passthrough to preserve user identity:
Long-Running Tools¶
The transport timeout applies to all HTTP operations, not just connection establishment. For tools that take longer than 30 seconds:
Health Check Timeout Trade-offs¶
Pool staleness checks use MCP_SESSION_POOL_TRANSPORT_TIMEOUT (default 30s) for session acquisition. When MCP_SESSION_POOL_EXPLICIT_HEALTH_RPC=true, the explicit RPC call uses HEALTH_CHECK_TIMEOUT (default 5s).
Behavior summary:
| Check Type | Timeout Used | Default |
|---|---|---|
| Pool staleness check (idle > interval) | MCP_SESSION_POOL_TRANSPORT_TIMEOUT | 30s |
| Explicit health RPC (when enabled) | HEALTH_CHECK_TIMEOUT | 5s |
| Session creation | MCP_SESSION_POOL_CREATE_TIMEOUT | 30s |
Trade-off: The 30s transport timeout allows long-running tools to complete but means unhealthy sessions may take longer to detect. If you need faster failure detection:
# Stricter health checks (5s timeout for explicit RPC)
MCP_SESSION_POOL_EXPLICIT_HEALTH_RPC=true
HEALTH_CHECK_TIMEOUT=5
# Or reduce transport timeout (affects all operations)
MCP_SESSION_POOL_TRANSPORT_TIMEOUT=10
Circuit Breaker Behavior¶
The circuit breaker is keyed by URL only (not per-identity). After MCP_SESSION_POOL_CIRCUIT_BREAKER_THRESHOLD consecutive session creation failures for a URL, the circuit opens and all requests fail fast for MCP_SESSION_POOL_CIRCUIT_BREAKER_RESET seconds.
Note: Only session creation failures (connection refused, SSL errors) trip the circuit. Tool call failures do not affect the circuit breaker.
Monitoring¶
Monitor pool performance via the metrics endpoint:
Response includes:
total_sessions_created/total_sessions_reused: Pool hit ratiopool_hits/pool_misses: Cache effectivenessactive_sessions: Current utilizationcircuit_breaker_states: Per-URL circuit status
Operational Checklist¶
Before enabling pooling in production:
- Confirm upstream MCP servers are stateless for any shared/anonymous access
- Verify identity headers are present and stable
- Validate tool call durations vs
MCP_SESSION_POOL_TRANSPORT_TIMEOUT - Ensure tracing headers are not relied upon in pooled sessions
After enabling pooling:
- Monitor pool metrics at
/admin/mcp-pool/metrics - Watch for increased tool timeouts or unexpected auth failures
- Verify correlation IDs in upstream logs (note: per-request headers are stripped from pooled sessions)
7 - Nginx Reverse Proxy Tuning¶
When deploying ContextForge behind nginx (as in the default docker-compose.yml), several optimizations can significantly improve performance under load.
Admin UI Caching¶
Admin pages use Jinja2 template rendering which is CPU-intensive under high concurrency. The default nginx configuration enables short-TTL caching with multi-tenant isolation:
| Setting | Value | Purpose |
|---|---|---|
proxy_cache_valid | 5s | Short TTL keeps data fresh while reducing backend load |
Cache-Control | private | Prevents CDNs/proxies from caching user-specific content |
| Cache key | Includes auth tokens | Per-user isolation prevents data leakage |
Performance impact (4000 concurrent users):
| Metric | Without Caching | With Caching | Improvement |
|---|---|---|---|
/admin/ response time | 5414ms | 199ms | 96% |
| Throughput | ~2400 RPS | ~4000 RPS | 67% |
Multi-Tenant Cache Safety¶
The cache key includes all authentication credentials to ensure user isolation:
proxy_cache_key "$scheme$request_method$host$request_uri$is_args$args$http_authorization$cookie_jwt_token$cookie_access_token";
This ensures:
- Bearer token auth (
$http_authorization): API clients get isolated caches - Primary session cookie (
$cookie_jwt_token): Browser users get isolated caches - Alternative auth cookie (
$cookie_access_token): Fallback auth method also isolated
Verifying Cache Behavior¶
Check the X-Cache-Status header to verify caching is working:
curl -I http://localhost:8080/admin/ -b "jwt_token=..." | grep X-Cache
# X-Cache-Status: MISS (first request)
# X-Cache-Status: HIT (subsequent requests within 5s)
# X-Cache-Status: STALE (background refresh in progress)
Disabling Admin Caching¶
If you need real-time admin data or have concerns about caching, modify infra/nginx/nginx.conf:
location /admin {
proxy_cache off;
add_header Cache-Control "no-cache, no-store, must-revalidate" always;
# ... rest of config
}
8 - High-Concurrency Production Tuning¶
This section covers comprehensive tuning for deployments handling 1000+ concurrent users. These settings have been tested under load with 6500 concurrent users.
8.1 Database Connection Pool (SQLAlchemy)¶
The gateway's internal connection pool manages connections between the application and PgBouncer (or PostgreSQL directly).
| Variable | Default | High-Concurrency | Description |
|---|---|---|---|
DB_POOL_CLASS | auto | queue | Pool implementation. Use queue with PgBouncer, null for safest option |
DB_POOL_PRE_PING | false | true | Validate connections before use (SELECT 1). Prevents stale connection errors |
DB_POOL_SIZE | 5 | 20 | Persistent connections per worker. Formula: (concurrent_users / workers) × 0.5 |
DB_MAX_OVERFLOW | 10 | 10 | Extra connections allowed during spikes |
DB_POOL_TIMEOUT | 30 | 60 | Seconds to wait for available connection before error |
DB_POOL_RECYCLE | 3600 | 60 | Recycle connections after N seconds. Must be less than PgBouncer CLIENT_IDLE_TIMEOUT |
Example high-concurrency configuration:
# With PgBouncer (recommended)
DB_POOL_CLASS=queue
DB_POOL_PRE_PING=true
DB_POOL_SIZE=20
DB_MAX_OVERFLOW=10
DB_POOL_TIMEOUT=60
DB_POOL_RECYCLE=60 # Half of PgBouncer CLIENT_IDLE_TIMEOUT (120s)
Common errors and solutions:
| Error | Cause | Solution |
|---|---|---|
QueuePool limit reached, connection timed out | Pool too small for load | Increase DB_POOL_SIZE (e.g., 5→20) |
idle transaction timeout | Transactions not committed | Ensure all endpoints call db.commit() |
connection reset by peer | PgBouncer recycled stale connection | Set DB_POOL_RECYCLE < CLIENT_IDLE_TIMEOUT |
8.2 PgBouncer Connection Pooler¶
PgBouncer multiplexes many application connections into fewer PostgreSQL connections, dramatically reducing database overhead.
Client-Side Settings (from gateway workers)¶
| Variable | Default | High-Concurrency | Description |
|---|---|---|---|
MAX_CLIENT_CONN | 1000 | 5000-15000 | Max connections from all gateway workers. Formula: replicas × workers × pool_size × 2 |
DEFAULT_POOL_SIZE | 20 | 600 | Shared connections to PostgreSQL per database |
MIN_POOL_SIZE | 0 | 100 | Pre-warmed connections for instant response |
RESERVE_POOL_SIZE | 0 | 150 | Emergency pool for burst traffic |
RESERVE_POOL_TIMEOUT | 5 | 2 | Seconds before tapping reserve pool |
Server-Side Settings (to PostgreSQL)¶
| Variable | Default | High-Concurrency | Description |
|---|---|---|---|
MAX_DB_CONNECTIONS | 100 | 700 | Max connections to PostgreSQL. Must be < PostgreSQL max_connections |
MAX_USER_CONNECTIONS | 100 | 700 | Per-user limit, typically equals MAX_DB_CONNECTIONS |
SERVER_LIFETIME | 3600 | 1800-3600 | Recycle server connections after N seconds |
SERVER_IDLE_TIMEOUT | 600 | 600 | Close unused server connections after N seconds |
Timeout Settings¶
| Variable | Default | High-Concurrency | Description |
|---|---|---|---|
QUERY_WAIT_TIMEOUT | 120 | 60 | Max wait for available connection |
CLIENT_IDLE_TIMEOUT | 0 | 120-300 | Close idle client connections. Gateway DB_POOL_RECYCLE must be less than this |
SERVER_CONNECT_TIMEOUT | 15 | 5 | Timeout for new PostgreSQL connections |
IDLE_TRANSACTION_TIMEOUT | 0 | 60-300 | Kill transactions idle > N seconds. Critical for preventing connection starvation |
Transaction Reset Settings¶
| Variable | Default | High-Concurrency | Description |
|---|---|---|---|
SERVER_RESET_QUERY | DISCARD ALL | DISCARD ALL | Reset connection state when returned to pool |
SERVER_RESET_QUERY_ALWAYS | 0 | 1 | Always run reset query even after clean transactions |
POOL_MODE | session | transaction | Connection returned after each transaction (required for web apps) |
Example PgBouncer configuration (docker-compose.yml):
pgbouncer:
image: edoburu/pgbouncer:latest
environment:
- DATABASE_URL=postgres://postgres:password@postgres:5432/mcp
- POOL_MODE=transaction
# Client limits
- MAX_CLIENT_CONN=5000
- DEFAULT_POOL_SIZE=600
- MIN_POOL_SIZE=100
- RESERVE_POOL_SIZE=150
# Server limits
- MAX_DB_CONNECTIONS=700
- SERVER_LIFETIME=1800
- SERVER_IDLE_TIMEOUT=600
# Timeouts
- QUERY_WAIT_TIMEOUT=60
- CLIENT_IDLE_TIMEOUT=120
- IDLE_TRANSACTION_TIMEOUT=60
# Reset
- SERVER_RESET_QUERY=DISCARD ALL
- SERVER_RESET_QUERY_ALWAYS=1
ulimits:
nofile:
soft: 65536
hard: 65536
8.3 Container File Descriptor Limits (ulimits)¶
Each network connection requires a file descriptor. Containers default to 1024 soft limit, which is insufficient for high concurrency.
| Container | Recommended nofile | Rationale |
|---|---|---|
| PgBouncer | 65536 | MAX_CLIENT_CONN + MAX_DB_CONNECTIONS + overhead |
| PostgreSQL | 8192 | max_connections + internal FDs |
| Redis | 65536 | maxclients + overhead |
| Gateway | 65536 | HTTP connections + DB connections + MCP sessions |
| Nginx | 65535 | worker_connections × workers |
docker-compose.yml example:
services:
pgbouncer:
ulimits:
nofile:
soft: 65536
hard: 65536
postgres:
ulimits:
nofile:
soft: 8192
hard: 8192
redis:
ulimits:
nofile:
soft: 65536
hard: 65536
gateway:
ulimits:
nofile:
soft: 65535
hard: 65535
Verification:
# Check container limits
docker exec <container> cat /proc/1/limits | grep "open files"
# Count current open FDs
docker exec <container> ls /proc/1/fd | wc -l
Common error: accept() failed: No file descriptors available - Increase ulimits.nofile.
8.4 Host System Tuning (sysctl)¶
The Docker host kernel settings affect all containers. These must be set on the host, not in containers.
| Setting | Default | High-Concurrency | Description |
|---|---|---|---|
net.core.somaxconn | 128 | 65535 | Max socket listen backlog |
net.core.netdev_max_backlog | 1000 | 65535 | Max packets queued before processing |
net.ipv4.tcp_max_syn_backlog | 128 | 65535 | Max SYN packets pending connection |
net.ipv4.tcp_fin_timeout | 60 | 15 | Faster TIME_WAIT cleanup |
net.ipv4.tcp_tw_reuse | 0 | 1 | Reuse TIME_WAIT sockets |
net.ipv4.ip_local_port_range | 32768 60999 | 1024 65535 | More ephemeral ports |
fs.file-max | varies | 2097152 | System-wide file descriptor limit |
Apply temporarily:
sudo sysctl -w \
net.core.somaxconn=65535 \
net.core.netdev_max_backlog=65535 \
net.ipv4.tcp_max_syn_backlog=65535 \
net.ipv4.tcp_fin_timeout=15 \
net.ipv4.tcp_tw_reuse=1 \
net.ipv4.ip_local_port_range="1024 65535"
Apply permanently (/etc/sysctl.d/99-mcp-loadtest.conf):
# High-concurrency TCP tuning for ContextForge load testing
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 1024 65535
fs.file-max = 2097152
Then apply: sudo sysctl -p /etc/sysctl.d/99-mcp-loadtest.conf
8.5 HTTPX Client Pool (Outbound HTTP)¶
The gateway uses a shared HTTPX client pool for all outbound requests (federation, health checks, A2A, MCP tool calls).
| Variable | Default | High-Concurrency | Description |
|---|---|---|---|
HTTPX_MAX_CONNECTIONS | 100 | 200 | Total connections in pool |
HTTPX_MAX_KEEPALIVE_CONNECTIONS | 20 | 100 | Persistent keepalive connections |
HTTPX_KEEPALIVE_EXPIRY | 5.0 | 30.0 | Idle connection expiry (seconds) |
HTTPX_CONNECT_TIMEOUT | 5.0 | 5.0 | TCP connection timeout |
HTTPX_READ_TIMEOUT | 30.0 | 120.0 | Response read timeout (increase for slow tools) |
HTTPX_POOL_TIMEOUT | 5.0 | 10.0 | Wait for available connection |
Example:
HTTPX_MAX_CONNECTIONS=200
HTTPX_MAX_KEEPALIVE_CONNECTIONS=100
HTTPX_KEEPALIVE_EXPIRY=30.0
HTTPX_READ_TIMEOUT=120.0
HTTPX_POOL_TIMEOUT=10.0
8.6 Complete High-Concurrency Configuration¶
Here's a complete configuration for 3000-6500 concurrent users:
# docker-compose.yml gateway environment
environment:
# Database pool (via PgBouncer)
- DATABASE_URL=postgresql+psycopg://postgres:password@pgbouncer:6432/mcp
- DB_POOL_CLASS=queue
- DB_POOL_PRE_PING=true
- DB_POOL_SIZE=20
- DB_MAX_OVERFLOW=10
- DB_POOL_TIMEOUT=60
- DB_POOL_RECYCLE=60
# MCP Session Pool
- MCP_SESSION_POOL_ENABLED=true
- MCP_SESSION_POOL_MAX_PER_KEY=200
- MCP_SESSION_POOL_ACQUIRE_TIMEOUT=60
# HTTPX Client Pool
- HTTPX_MAX_CONNECTIONS=200
- HTTPX_MAX_KEEPALIVE_CONNECTIONS=100
- HTTPX_READ_TIMEOUT=120.0
# Redis
- REDIS_MAX_CONNECTIONS=150
# Performance
- LOG_LEVEL=ERROR
- DISABLE_ACCESS_LOG=true
- AUDIT_TRAIL_ENABLED=false
9 - MCP Streamable HTTP Transport Tuning¶
The MCP Streamable HTTP endpoint (/servers/{id}/mcp) has its own performance characteristics distinct from the REST API (/rpc). This section covers settings that specifically affect MCP protocol performance.
Critical Settings¶
These settings have the largest measured impact on MCP throughput:
| Setting | Recommended | Impact | Notes |
|---|---|---|---|
MCP_SESSION_POOL_ENABLED | true | ~10% RPS improvement | Reuses upstream MCP server connections |
JSON_RESPONSE_ENABLED | true | Required — disabling breaks JSON clients | Returns JSON instead of SSE framing |
USE_STATEFUL_SESSIONS | false | Required for multi-replica | Stateful sessions are per-process |
DB_POOL_CLASS | queue | Required — null causes ~55% RPS loss | QueuePool with PgBouncer |
AUTH_CACHE_ENABLED | true | Significant — reduces per-request DB queries | Caches user/team/role lookups |
CACHE_TYPE | redis | Significant — shared cache across workers | Enables cross-worker caching |
Settings with Negligible Impact¶
These settings were tested and found to have less than ~3% effect on MCP throughput. Tune them for correctness, not performance:
| Setting | Why Negligible |
|---|---|
VALIDATION_MIDDLEWARE_ENABLED | MCP requests bypass most validation |
DB_METRICS_RECORDING_ENABLED | Writes are buffered, minimal per-request overhead |
REGISTRY_CACHE_ENABLED | MCP handlers use their own DB queries, not the registry cache |
PERFORMANCE_TRACKING_ENABLED | Lightweight in-memory tracking |
TOKEN_USAGE_LOGGING_ENABLED | Rate-limited to one DB write per 5 minutes per token |
CORRELATION_ID_ENABLED | Adds/reads one header per request |
MCP SDK / FastMCP Tunables¶
When running MCP servers behind the gateway:
| Setting | Recommendation | Why |
|---|---|---|
MCP_SESSION_POOL_HEALTH_CHECK_METHODS | ["skip"] for max throughput | Skips health checks on pooled sessions |
MCP_SESSION_POOL_CLEANUP_TIMEOUT | 0.5 | Fast cleanup of stale sessions |
Upstream stateless_http | true for multi-replica servers | Avoids session affinity requirements |
Upstream json_response | true for unary tool calls | Removes SSE framing overhead |
terminate_on_close | false for pooled sessions | Prevents session teardown on reuse |
| Streamable HTTP over SSE | Prefer Streamable HTTP | Recommended production transport |
MCP Worker Tuning¶
The gateway worker count (GUNICORN_WORKERS) directly affects MCP throughput:
- Too few workers: CPU underutilized, low RPS
- Too many workers: Excessive context switching, DB connection pressure
- Rule of thumb: Match
GUNICORN_WORKERSto the available CPU cores per container. 24 workers per 8-CPU container is well-tuned for MCP workloads.
Auth Cache TTL Guidance¶
The auth cache prevents repeated database lookups for the same JWT token. Higher TTLs reduce DB pressure but delay permission changes:
| Setting | Max Allowed | Recommendation |
|---|---|---|
AUTH_CACHE_USER_TTL | 300s | 300s (max) for performance |
AUTH_CACHE_TEAM_TTL | 300s | 300s (max) for performance |
AUTH_CACHE_ROLE_TTL | 300s | 300s (max) for performance |
AUTH_CACHE_REVOCATION_TTL | 120s | 120s (max) — security-sensitive |
AUTH_CACHE_BATCH_QUERIES | — | true — batches three queries into one |
10 - Disable unused features¶
ContextForge has many optional features that are enabled by default for completeness but consume CPU, memory, or database I/O even when not used. Disabling features you do not need is a free performance win that does not require additional resources.
High-impact features to disable¶
These features have measurable per-request or background overhead. Disable any you are not actively using:
| Feature | Setting | Default | Overhead When Enabled |
|---|---|---|---|
| Admin UI | MCPGATEWAY_UI_ENABLED | false | Template rendering CPU, admin middleware checks on every request |
| Admin API | MCPGATEWAY_ADMIN_API_ENABLED | false | Exposes additional endpoints, admin auth middleware |
| A2A protocol | MCPGATEWAY_A2A_ENABLED | true | A2A router registration, agent discovery, metrics tracking |
| A2A metrics | MCPGATEWAY_A2A_METRICS_ENABLED | true | DB writes per A2A invocation |
| LLM chat | LLMCHAT_ENABLED | true | Session management, Redis locks, chat routing middleware |
| Catalog | MCPGATEWAY_CATALOG_ENABLED | true | Catalog server sync, background health checks |
| Plugins | PLUGINS_ENABLED | false | Plugin discovery, hook dispatch on every request |
| DB metrics recording | DB_METRICS_RECORDING_ENABLED | true | One buffered DB write per tool/resource/prompt execution |
| Token usage logging | TOKEN_USAGE_LOGGING_ENABLED | true | DB write per unique token per 5-minute window |
| Structured logging (DB) | STRUCTURED_LOGGING_DATABASE_ENABLED | false | DB writes per log entry (high overhead) |
| Audit trail | AUDIT_TRAIL_ENABLED | false | DB write per mutating request (high overhead) |
| Security logging | SECURITY_LOGGING_ENABLED | false | DB writes for auth events |
| Observability | OBSERVABILITY_ENABLED | false | Span creation, trace storage, request instrumentation |
| Prometheus | ENABLE_METRICS | false | Per-request histogram updates, /metrics endpoint |
| Net connections count | MCPGATEWAY_PERFORMANCE_NET_CONNECTIONS_ENABLED | true | psutil.net_connections() call in performance stats |
Medium-impact features¶
These add some overhead but are useful in most deployments. Disable only if you are certain you do not need them:
| Feature | Setting | Default | When to Disable |
|---|---|---|---|
| Correlation ID | CORRELATION_ID_ENABLED | true | If your external proxy already handles trace IDs |
| Performance tracking | PERFORMANCE_TRACKING_ENABLED | true | If using external APM (Datadog, New Relic, etc.) |
| Metrics aggregation | METRICS_AGGREGATION_ENABLED | true | If using external metrics (Prometheus, Grafana) |
| Metrics rollup | METRICS_ROLLUP_ENABLED | true | If raw metrics are exported externally |
| Metrics cleanup | METRICS_CLEANUP_ENABLED | true | Only disable if you manage retention externally |
| Elicitation | MCPGATEWAY_ELICITATION_ENABLED | true | If no MCP clients use elicitation |
| Tool cancellation | MCPGATEWAY_TOOL_CANCELLATION_ENABLED | true | If clients do not cancel in-flight tool calls |
| SSE keepalive | SSE_KEEPALIVE_ENABLED | true | If not using SSE transport |
| Dynamic client registration | DCR_ENABLED | true | If not using OAuth DCR flow |
| OAuth discovery | OAUTH_DISCOVERY_ENABLED | true | If not using OAuth |
Low-impact features (safe to leave enabled)¶
These have negligible runtime cost and are generally worth keeping:
| Feature | Setting | Default | Notes |
|---|---|---|---|
| Security headers | SECURITY_HEADERS_ENABLED | true | Adds static headers, near-zero CPU |
| CORS | CORS_ENABLED | true | Standard cross-origin handling |
| SSRF protection | SSRF_PROTECTION_ENABLED | true | URL validation on registration only |
| Password policy | PASSWORD_POLICY_ENABLED | true | Checked only during password set/change |
| Well-known endpoints | WELL_KNOWN_ENABLED | true | Rarely hit, cached responses |
| Auth cache | AUTH_CACHE_ENABLED | true | Improves performance; do not disable |
| Registry cache | REGISTRY_CACHE_ENABLED | true | Improves performance; do not disable |
| Tool lookup cache | TOOL_LOOKUP_CACHE_ENABLED | true | Improves performance; do not disable |
Deployment profiles¶
MCP-only deployment (maximum MCP protocol throughput):
# Disable everything not needed for MCP tool serving
MCPGATEWAY_UI_ENABLED=false
MCPGATEWAY_ADMIN_API_ENABLED=false
MCPGATEWAY_A2A_ENABLED=false
MCPGATEWAY_CATALOG_ENABLED=false
LLMCHAT_ENABLED=false
PLUGINS_ENABLED=false
OBSERVABILITY_ENABLED=false
ENABLE_METRICS=false
AUDIT_TRAIL_ENABLED=false
SECURITY_LOGGING_ENABLED=false
STRUCTURED_LOGGING_DATABASE_ENABLED=false
CORRELATION_ID_ENABLED=false
DB_METRICS_RECORDING_ENABLED=false
MCPGATEWAY_PERFORMANCE_NET_CONNECTIONS_ENABLED=false
COMPRESSION_ENABLED=false # Let nginx handle compression
DISABLE_ACCESS_LOG=true
# Keep these enabled
AUTH_CACHE_ENABLED=true
REGISTRY_CACHE_ENABLED=true
MCP_SESSION_POOL_ENABLED=true
CACHE_TYPE=redis
Full-featured production (all features, tuned for performance):
# Features enabled
MCPGATEWAY_UI_ENABLED=true
MCPGATEWAY_ADMIN_API_ENABLED=true
MCPGATEWAY_A2A_ENABLED=true
PLUGINS_ENABLED=true
# Expensive features disabled unless needed
OBSERVABILITY_ENABLED=false # Enable only when tracing needed
ENABLE_METRICS=false # Enable with Prometheus
AUDIT_TRAIL_ENABLED=false # Enable for compliance
STRUCTURED_LOGGING_DATABASE_ENABLED=false
SECURITY_LOGGING_ENABLED=false
DB_METRICS_RECORDING_ENABLED=true # Keep for built-in analytics
# Caching maximized
AUTH_CACHE_ENABLED=true
AUTH_CACHE_BATCH_QUERIES=true
REGISTRY_CACHE_ENABLED=true
MCP_SESSION_POOL_ENABLED=true
CACHE_TYPE=redis
Development / debugging (full observability, relaxed performance):
# Everything on for debugging
MCPGATEWAY_UI_ENABLED=true
MCPGATEWAY_ADMIN_API_ENABLED=true
OBSERVABILITY_ENABLED=true
ENABLE_METRICS=true
DB_METRICS_RECORDING_ENABLED=true
DB_QUERY_LOG_ENABLED=true # N+1 detection
STRUCTURED_LOGGING_DATABASE_ENABLED=true
LOG_LEVEL=INFO
CORRELATION_ID_ENABLED=true
PERFORMANCE_TRACKING_ENABLED=true
11 - Security tips while tuning¶
- Never commit real
JWT_SECRET_KEY, DB passwords, or tokens-use.env.exampleas a template. - Prefer platform secrets (K8s Secrets, Code Engine secrets) over baking creds into the image.
- If you enable
gevent/eventlet, pin their versions and run bandit or trivy scans.
See Also¶
- Performance Profiling Guide - py-spy, memray, PostgreSQL profiling, MCP bottleneck triage
- Database Performance Guide - N+1 detection, query logging, query counting
- Performance Architecture - MCP request path, caching layers, scaling capacity
- Scaling Guide - Production scaling configuration