Scaling MCP Gatewayยถ
Comprehensive guide to scaling MCP Gateway from development to production, covering vertical scaling, horizontal scaling, connection pooling, performance tuning, and Kubernetes deployment strategies.
Overviewยถ
MCP Gateway is designed to scale from single-container development environments to distributed multi-node production deployments. This guide covers:
- Vertical Scaling: Optimizing single-instance performance with Gunicorn workers
- Horizontal Scaling: Multi-container deployments with shared state
- Database Optimization: PostgreSQL connection pooling and settings
- Cache Architecture: Redis for distributed caching
- Performance Tuning: Configuration and benchmarking
- Kubernetes Deployment: HPA, resource limits, and best practices
Table of Contentsยถ
- Understanding the GIL and Worker Architecture
- Vertical Scaling with Gunicorn
- Future: Python 3.14 and PostgreSQL 18
- Horizontal Scaling with Kubernetes
- Database Connection Pooling
- Redis for Distributed Caching
- Performance Tuning
- Benchmarking and Load Testing
- Health Checks and Readiness
- Stateless Architecture and Long-Running Connections
- Kubernetes Production Deployment
- Monitoring and Observability
1. Understanding the GIL and Worker Architectureยถ
The Python Global Interpreter Lock (GIL)ยถ
Python's Global Interpreter Lock (GIL) prevents multiple native threads from executing Python bytecode simultaneously. This means:
- Single worker = Single CPU core usage (even on multi-core systems)
- I/O-bound workloads (API calls, database queries) benefit from async/await
- CPU-bound workloads (JSON parsing, encryption) require multiple processes
Pydantic v2: Rust-Powered Performanceยถ
MCP Gateway leverages Pydantic v2.11+ for all request/response validation and schema definitions. Unlike pure Python libraries, Pydantic v2 includes a Rust-based core (pydantic-core
) that significantly improves performance:
Performance benefits: - 5-50x faster validation compared to Pydantic v1 - JSON parsing in Rust (bypasses GIL for serialization/deserialization) - Schema validation runs in compiled Rust code - Reduced CPU overhead for request processing
Impact on scaling: - 5,463 lines of Pydantic schemas (mcpgateway/schemas.py
) - Every API request validated through Rust-optimized code - Lower CPU usage per request = higher throughput per worker - Rust components release the GIL during execution
This means that even within a single worker process, Pydantic's Rust core can run concurrently with Python code for validation-heavy workloads.
MCP Gateway's Solution: Gunicorn with Multiple Workersยถ
MCP Gateway uses Gunicorn with UvicornWorker to spawn multiple worker processes:
# gunicorn.config.py
workers = 8 # Multiple processes bypass the GIL
worker_class = "uvicorn.workers.UvicornWorker" # Async support
timeout = 600 # 10-minute timeout for long-running operations
preload_app = True # Load app once, then fork (memory efficient)
Key benefits:
- Each worker is a separate process with its own GIL
- 8 workers = ability to use 8 CPU cores
- UvicornWorker enables async I/O within each worker
- Preloading reduces memory footprint (shared code segments)
The trade-off is that you are running multiple Python interpreter instances, and each consumes additional memory.
This also requires having shared state (e.g. Redis or a Database).ยถ
2. Vertical Scaling with Gunicornยถ
Worker Count Calculationยถ
Formula: workers = (2 ร CPU_cores) + 1
Examples:
CPU Cores | Recommended Workers | Use Case |
---|---|---|
1 | 2-3 | Development/testing |
2 | 4-5 | Small production |
4 | 8-9 | Medium production |
8 | 16-17 | Large production |
Configuration Methodsยถ
Environment Variablesยถ
# Automatic detection based on CPU cores
export GUNICORN_WORKERS=auto
# Manual override
export GUNICORN_WORKERS=16
export GUNICORN_TIMEOUT=600
export GUNICORN_MAX_REQUESTS=100000
export GUNICORN_MAX_REQUESTS_JITTER=100
export GUNICORN_PRELOAD_APP=true
Kubernetes ConfigMapยถ
# charts/mcp-stack/values.yaml
mcpContextForge:
config:
GUNICORN_WORKERS: "16" # Number of worker processes
GUNICORN_TIMEOUT: "600" # Worker timeout (seconds)
GUNICORN_MAX_REQUESTS: "100000" # Requests before worker restart
GUNICORN_MAX_REQUESTS_JITTER: "100" # Prevents thundering herd
GUNICORN_PRELOAD_APP: "true" # Memory optimization
Resource Allocationยถ
CPU: Allocate 1 CPU core per 2 workers (allows for I/O wait)
Memory: - Base: 256MB - Per worker: 128-256MB (depending on workload) - Formula: memory = 256 + (workers ร 200)
MB
Example for 16 workers: - CPU: 8-10 cores
(allows headroom) - Memory: 3.5-4 GB
(256 + 16ร200 = 3.5GB)
# Kubernetes resource limits
resources:
limits:
cpu: 10000m # 10 cores
memory: 4Gi
requests:
cpu: 8000m # 8 cores
memory: 3584Mi # 3.5GB
3. Future: Python 3.14 and PostgreSQL 18ยถ
Python 3.14 (Free-Threaded Mode)ยถ
Status: Beta (as of July 2025) - PEP 703
Python 3.14 introduces optional free-threading (GIL removal), a groundbreaking change that enables true parallel multi-threading:
# Enable free-threading mode
python3.14 -X gil=0 -m gunicorn ...
# Or use PYTHON_GIL environment variable
PYTHON_GIL=0 python3.14 -m gunicorn ...
Performance characteristics:
Workload Type | Expected Impact |
---|---|
Single-threaded | 3-15% slower (overhead from thread-safety mechanisms) |
Multi-threaded (I/O-bound) | Minimal impact (already benefits from async/await) |
Multi-threaded (CPU-bound) | Near-linear scaling with CPU cores |
Multi-process (current) | No change (already bypasses GIL) |
Benefits when available: - True parallel threads: Multiple threads execute Python code simultaneously - Lower memory overhead: Threads share memory (vs. separate processes) - Faster inter-thread communication: Shared memory, no IPC overhead - Better resource efficiency: One interpreter instance instead of multiple processes
Trade-offs: - Single-threaded penalty: 3-15% slower due to fine-grained locking - Library compatibility: Some C extensions need updates (most popular libraries already compatible) - Different scaling model: Move from workers=16
to workers=2 --threads=32
Migration strategy:
-
Now (Python 3.11-3.13): Continue using multi-process Gunicorn
-
Python 3.14 beta: Test in staging environment
-
Python 3.14 stable: Evaluate hybrid approach
-
Post-migration: Thread-based scaling
Current recommendation: - Production: Use Python 3.11-3.13 with multi-process Gunicorn (proven, stable) - Testing: Experiment with Python 3.14 beta in non-production environments - Monitoring: Watch for library compatibility announcements
Why MCP Gateway is well-positioned for free-threading:
MCP Gateway's architecture already benefits from components that will perform even better with Python 3.14:
- Pydantic v2 Rust core: Already bypasses GIL for validation - will work seamlessly with free-threading
- FastAPI/Uvicorn: Built for async I/O - natural fit for thread-based concurrency
- SQLAlchemy async: Database operations already non-blocking
- Stateless design: No shared mutable state between requests
Resources: - Python 3.14 Free-Threading Guide - PEP 703: Making the GIL Optional - Python 3.14 Release Schedule - Pydantic v2 Performance
PostgreSQL 18 (Async I/O)ยถ
Status: Development (expected 2025)
PostgreSQL 18 introduces native async I/O:
- Improved connection handling: Better async query performance
- Reduced latency: Non-blocking I/O operations
- Better scalability: Efficient connection multiplexing
Current recommendation: PostgreSQL 16+ (stable async support via asyncpg)
4. Horizontal Scaling with Kubernetesยถ
Architecture Overviewยถ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Load Balancer โ
โ (Kubernetes Service) โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโผโโโโโโโโโโ โโโโโโโโโโผโโโโโโโโโโ
โ Gateway Pod 1 โ โ Gateway Pod 2 โ
โ (8 workers) โ โ (8 workers) โ
โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโผโโโโโโโ โโโโโโโโโโโโผโโโโโโ
โ PostgreSQL โ โ Redis โ
โ (shared) โ โ (shared) โ
โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
Shared State Requirementsยถ
For multi-pod deployments:
- Shared PostgreSQL: All data (servers, tools, users, teams)
- Shared Redis: Distributed caching and session management
- Stateless pods: No local state, can be killed/restarted anytime
Kubernetes Deploymentยถ
Helm Chart Configurationยถ
# charts/mcp-stack/values.yaml
mcpContextForge:
replicaCount: 3 # Start with 3 pods
# Horizontal Pod Autoscaler
hpa:
enabled: true
minReplicas: 3 # Never scale below 3
maxReplicas: 20 # Scale up to 20 pods
targetCPUUtilizationPercentage: 70 # Scale at 70% CPU
targetMemoryUtilizationPercentage: 80 # Scale at 80% memory
# Pod resources
resources:
limits:
cpu: 2000m # 2 cores per pod
memory: 4Gi
requests:
cpu: 1000m # 1 core per pod
memory: 2Gi
# Environment configuration
config:
GUNICORN_WORKERS: "8" # 8 workers per pod
CACHE_TYPE: redis # Shared cache
DB_POOL_SIZE: "50" # Per-pod pool size
# Shared PostgreSQL
postgres:
enabled: true
resources:
limits:
cpu: 4000m # 4 cores
memory: 8Gi
requests:
cpu: 2000m
memory: 4Gi
# Important: Set max_connections
# Formula: (num_pods ร DB_POOL_SIZE ร 1.2) + 20
# Example: (20 pods ร 50 pool ร 1.2) + 20 = 1220
config:
max_connections: 1500 # Adjust based on scale
# Shared Redis
redis:
enabled: true
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 1000m
memory: 2Gi
Deploy with Helmยถ
# Install/upgrade with custom values
helm upgrade --install mcp-stack ./charts/mcp-stack \
--namespace mcp-gateway \
--create-namespace \
--values production-values.yaml
# Verify HPA
kubectl get hpa -n mcp-gateway
Horizontal Scaling Calculationยถ
Total capacity = pods ร workers ร requests_per_second
Example: - 10 pods ร 8 workers ร 100 RPS = 8,000 RPS
Database connections needed: - 10 pods ร 50 pool size = 500 connections - Add 20% overhead = 600 connections - Set max_connections=1000
(buffer for maintenance)
5. Database Connection Poolingยถ
Connection Pool Architectureยถ
SQLAlchemy manages a connection pool per process:
Pod 1 (8 workers) โ 8 connection pools โ PostgreSQL
Pod 2 (8 workers) โ 8 connection pools โ PostgreSQL
Pod N (8 workers) โ 8 connection pools โ PostgreSQL
Pool Configurationยถ
Environment Variablesยถ
# Connection pool settings
DB_POOL_SIZE=50 # Persistent connections per worker
DB_MAX_OVERFLOW=10 # Additional connections allowed
DB_POOL_TIMEOUT=60 # Wait time before timeout (seconds)
DB_POOL_RECYCLE=3600 # Recycle connections after 1 hour
DB_MAX_RETRIES=5 # Retry attempts on failure
DB_RETRY_INTERVAL_MS=2000 # Retry interval
Configuration in Codeยถ
# mcpgateway/config.py
@property
def database_settings(self) -> dict:
return {
"pool_size": self.db_pool_size, # 50
"max_overflow": self.db_max_overflow, # 10
"pool_timeout": self.db_pool_timeout, # 60s
"pool_recycle": self.db_pool_recycle, # 3600s
}
PostgreSQL Configurationยถ
Calculate max_connectionsยถ
# Formula
max_connections = (num_pods ร num_workers ร pool_size ร 1.2) + buffer
# Example: 10 pods, 8 workers, 50 pool size
max_connections = (10 ร 8 ร 50 ร 1.2) + 200 = 5000 connections
PostgreSQL Configuration Fileยถ
# postgresql.conf
max_connections = 5000
shared_buffers = 16GB # 25% of RAM
effective_cache_size = 48GB # 75% of RAM
work_mem = 16MB # Per operation
maintenance_work_mem = 2GB
Managed Servicesยถ
IBM Cloud Databases for PostgreSQL:
# Increase max_connections via CLI
ibmcloud cdb deployment-configuration postgres \
--configuration max_connections=5000
AWS RDS:
Google Cloud SQL:
Connection Pool Monitoringยถ
# Health endpoint checks pool status
@app.get("/health")
async def healthcheck(db: Session = Depends(get_db)):
try:
db.execute(text("SELECT 1"))
return {"status": "healthy"}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
# Check PostgreSQL connections
kubectl exec -it postgres-pod -- psql -U admin -d postgresdb \
-c "SELECT count(*) FROM pg_stat_activity;"
6. Redis for Distributed Cachingยถ
Architectureยถ
Redis provides shared state across all Gateway pods:
- Session storage: User sessions (TTL: 3600s)
- Message cache: Ephemeral data (TTL: 600s)
- Federation cache: Gateway peer discovery
Configurationยถ
Enable Redis Cachingยถ
# .env or Kubernetes ConfigMap
CACHE_TYPE=redis
REDIS_URL=redis://redis-service:6379/0
CACHE_PREFIX=mcpgw:
SESSION_TTL=3600
MESSAGE_TTL=600
REDIS_MAX_RETRIES=3
REDIS_RETRY_INTERVAL_MS=2000
Kubernetes Deploymentยถ
# charts/mcp-stack/values.yaml
redis:
enabled: true
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 1000m
memory: 2Gi
# Enable persistence
persistence:
enabled: true
size: 10Gi
Redis Sizingยถ
Memory calculation: - Sessions: concurrent_users ร 50KB
- Messages: messages_per_minute ร 100KB ร (TTL/60)
Example: - 10,000 users ร 50KB = 500MB - 1,000 msg/min ร 100KB ร 10min = 1GB - Total: 1.5GB + 50% overhead = 2.5GB
High Availabilityยถ
Redis Sentinel (3+ nodes):
Redis Cluster (6+ nodes):
7. Performance Tuningยถ
Application Architecture Performanceยถ
MCP Gateway's technology stack is optimized for high performance:
Rust-Powered Components: - Pydantic v2 (5-50x faster validation via Rust core) - Uvicorn (ASGI server with Rust-based httptools)
Async-First Design: - FastAPI (async request handling) - SQLAlchemy 2.0 (async database operations) - asyncio event loop per worker
Performance characteristics: - Request validation: < 1ms (Pydantic v2 Rust core) - JSON serialization: 3-5x faster than pure Python - Database queries: Non-blocking async I/O - Concurrent requests per worker: 1000+ (async event loop)
System-Level Optimizationยถ
Kernel Parametersยถ
# /etc/sysctl.conf
net.core.somaxconn=4096
net.ipv4.tcp_max_syn_backlog=4096
net.ipv4.ip_local_port_range=1024 65535
net.ipv4.tcp_tw_reuse=1
fs.file-max=2097152
# Apply changes
sysctl -p
File Descriptorsยถ
Gunicorn Tuningยถ
Optimal Settingsยถ
# gunicorn.config.py
workers = (CPU_cores ร 2) + 1
timeout = 600 # Long enough for LLM calls
max_requests = 100000 # Prevent memory leaks
max_requests_jitter = 100 # Randomize restart
preload_app = True # Reduce memory
reuse_port = True # Load balance across workers
Worker Class Selectionยถ
UvicornWorker (default - best for async):
Gevent (alternative for I/O-heavy):
Application Tuningยถ
# Resource limits
TOOL_TIMEOUT=60
TOOL_CONCURRENT_LIMIT=10
RESOURCE_CACHE_SIZE=1000
RESOURCE_CACHE_TTL=3600
# Retry configuration
RETRY_MAX_ATTEMPTS=3
RETRY_BASE_DELAY=1.0
RETRY_MAX_DELAY=60
# Health check intervals
HEALTH_CHECK_INTERVAL=60
HEALTH_CHECK_TIMEOUT=10
UNHEALTHY_THRESHOLD=3
8. Benchmarking and Load Testingยถ
Toolsยถ
hey - HTTP load generator
# Install
brew install hey # macOS
sudo apt install hey # Ubuntu
# Or from source
go install github.com/rakyll/hey@latest
k6 - Modern load testing
Baseline Testยถ
Prepare Environmentยถ
# Get JWT token
export MCPGATEWAY_BEARER_TOKEN=$(python3 -m mcpgateway.utils.create_jwt_token \
--username admin@example.com --exp 0 --secret my-test-key)
# Create test payload
cat > payload.json <<EOF
{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/list",
"params": {}
}
EOF
Run Load Testยถ
#!/bin/bash
# test-load.sh
# Test parameters
REQUESTS=10000
CONCURRENCY=200
URL="http://localhost:4444/"
# Run test
hey -n $REQUESTS -c $CONCURRENCY \
-m POST \
-T application/json \
-H "Authorization: Bearer $MCPGATEWAY_BEARER_TOKEN" \
-D payload.json \
$URL
Interpret Resultsยถ
Summary:
Total: 5.2341 secs
Slowest: 0.5234 secs
Fastest: 0.0123 secs
Average: 0.1045 secs
Requests/sec: 1910.5623 โ Target metric
Status code distribution:
[200] 10000 responses
Response time histogram:
0.012 [1] |
0.050 [2341] |โ โ โ โ โ โ โ โ โ โ โ
0.100 [4523] |โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ
0.150 [2234] |โ โ โ โ โ โ โ โ โ โ โ
0.200 [901] |โ โ โ โ
0.250 [0] |
Key metrics: - Requests/sec: Throughput (target: >1000 RPS per pod) - P99 latency: 99th percentile (target: <500ms) - Error rate: 5xx responses (target: <0.1%)
Kubernetes Load Testยถ
# Deploy test pod
kubectl run load-test --image=williamyeh/hey:latest \
--rm -it --restart=Never -- \
-n 100000 -c 500 \
-H "Authorization: Bearer $TOKEN" \
http://mcp-gateway-service/
Advanced: k6 Scriptยถ
// load-test.k6.js
import http from 'k6/http';
import { check } from 'k6';
export let options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up
{ duration: '5m', target: 100 }, // Sustained
{ duration: '2m', target: 500 }, // Spike
{ duration: '5m', target: 500 }, // High load
{ duration: '2m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(99)<500'], // 99% < 500ms
http_req_failed: ['rate<0.01'], // <1% errors
},
};
export default function () {
const payload = JSON.stringify({
jsonrpc: '2.0',
id: 1,
method: 'tools/list',
params: {},
});
const res = http.post('http://localhost:4444/', payload, {
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${__ENV.TOKEN}`,
},
});
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
}
9. Health Checks and Readinessยถ
Health Check Endpointsยถ
MCP Gateway provides two health endpoints:
Liveness Probe: /health
ยถ
Purpose: Is the application alive?
@app.get("/health")
async def healthcheck(db: Session = Depends(get_db)):
"""Check database connectivity"""
try:
db.execute(text("SELECT 1"))
return {"status": "healthy"}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
Response:
Readiness Probe: /ready
ยถ
Purpose: Is the application ready to receive traffic?
@app.get("/ready")
async def readiness_check(db: Session = Depends(get_db)):
"""Check if ready to serve traffic"""
try:
await asyncio.to_thread(db.execute, text("SELECT 1"))
return JSONResponse({"status": "ready"}, status_code=200)
except Exception as e:
return JSONResponse(
{"status": "not ready", "error": str(e)},
status_code=503
)
Kubernetes Probe Configurationยถ
# charts/mcp-stack/templates/deployment-mcpgateway.yaml
containers:
- name: mcp-context-forge
# Startup probe (initial readiness)
startupProbe:
exec:
command:
- python3
- /app/mcpgateway/utils/db_isready.py
- --max-tries=1
- --timeout=2
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 60 # 5 minutes max
# Readiness probe (traffic routing)
readinessProbe:
httpGet:
path: /ready
port: 4444
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 2
successThreshold: 1
failureThreshold: 3
# Liveness probe (restart if unhealthy)
livenessProbe:
httpGet:
path: /health
port: 4444
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 2
successThreshold: 1
failureThreshold: 3
Probe Tuning Guidelinesยถ
Startup Probe: - Use for slow initialization (database migrations, model loading) - failureThreshold ร periodSeconds
= max startup time - Example: 60 ร 5s = 5 minutes
Readiness Probe: - Aggressive: Remove pod from load balancer quickly - failureThreshold
= 3 (fail fast) - periodSeconds
= 10 (frequent checks)
Liveness Probe: - Conservative: Avoid unnecessary restarts - failureThreshold
= 5 (tolerate transient issues) - periodSeconds
= 15 (less frequent)
Monitoring Healthยถ
# Check pod health
kubectl get pods -n mcp-gateway
# Detailed status
kubectl describe pod <pod-name> -n mcp-gateway
# Check readiness
kubectl get pods -n mcp-gateway \
-o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}'
# Test health endpoint
kubectl exec -it <pod-name> -n mcp-gateway -- \
curl http://localhost:4444/health
# View probe failures
kubectl get events -n mcp-gateway \
--field-selector involvedObject.name=<pod-name>
10. Stateless Architecture and Long-Running Connectionsยถ
Stateless Design Principlesยถ
MCP Gateway is designed to be stateless, enabling horizontal scaling:
- No local session storage: All sessions in Redis
- No in-memory caching (in production): Use Redis
- Database-backed state: All data in PostgreSQL
- Shared configuration: Environment variables via ConfigMap
Session Managementยถ
Stateful Sessions (Not Recommended for Scale)ยถ
Limitations: - Sessions tied to specific pods - Requires sticky sessions (session affinity) - Doesn't scale horizontally
Stateless Sessions (Recommended)ยถ
Benefits: - Any pod can handle any request - True horizontal scaling - Automatic failover
Long-Running Connectionsยถ
MCP Gateway supports long-running connections for streaming:
Server-Sent Events (SSE)ยถ
# Endpoint: /servers/{id}/sse
@app.get("/servers/{server_id}/sse")
async def sse_endpoint(server_id: int):
"""Stream events to client"""
# Connection can last minutes/hours
WebSocketยถ
# Endpoint: /servers/{id}/ws
@app.websocket("/servers/{server_id}/ws")
async def websocket_endpoint(server_id: int):
"""Bidirectional streaming"""
Load Balancer Configurationยถ
Kubernetes Service (default):
# Distributes connections across pods
apiVersion: v1
kind: Service
metadata:
name: mcp-gateway-service
spec:
type: ClusterIP
sessionAffinity: None # No sticky sessions
ports:
- port: 80
targetPort: 4444
NGINX Ingress (for WebSocket):
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
nginx.ingress.kubernetes.io/websocket-services: "mcp-gateway-service"
spec:
rules:
- host: gateway.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: mcp-gateway-service
port:
number: 80
Connection Lifecycleยถ
Client โ Load Balancer โ Pod A (SSE stream)
โ
(Pod A dies)
โ
Client โ Load Balancer โ Pod B (reconnect)
Best practices: 1. Client implements reconnection logic 2. Server sets SSE_KEEPALIVE_INTERVAL=30
(keepalive events) 3. Load balancer timeout > keepalive interval
11. Kubernetes Production Deploymentยถ
Reference Architectureยถ
# production-values.yaml
mcpContextForge:
# --- Scaling ---
replicaCount: 5
hpa:
enabled: true
minReplicas: 5
maxReplicas: 50
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# --- Resources ---
resources:
limits:
cpu: 4000m # 4 cores per pod
memory: 8Gi
requests:
cpu: 2000m # 2 cores per pod
memory: 4Gi
# --- Configuration ---
config:
# Gunicorn
GUNICORN_WORKERS: "16"
GUNICORN_TIMEOUT: "600"
GUNICORN_MAX_REQUESTS: "100000"
GUNICORN_PRELOAD_APP: "true"
# Database
DB_POOL_SIZE: "50"
DB_MAX_OVERFLOW: "10"
DB_POOL_TIMEOUT: "60"
DB_POOL_RECYCLE: "3600"
# Cache
CACHE_TYPE: redis
CACHE_PREFIX: mcpgw:
SESSION_TTL: "3600"
MESSAGE_TTL: "600"
# Performance
TOOL_CONCURRENT_LIMIT: "20"
RESOURCE_CACHE_SIZE: "2000"
# --- Health Checks ---
probes:
startup:
type: exec
command: ["python3", "/app/mcpgateway/utils/db_isready.py"]
periodSeconds: 5
failureThreshold: 60
readiness:
type: http
path: /ready
port: 4444
periodSeconds: 10
failureThreshold: 3
liveness:
type: http
path: /health
port: 4444
periodSeconds: 15
failureThreshold: 5
# --- PostgreSQL ---
postgres:
enabled: true
resources:
limits:
cpu: 8000m # 8 cores
memory: 32Gi
requests:
cpu: 4000m
memory: 16Gi
persistence:
enabled: true
size: 100Gi
storageClassName: fast-ssd
# Connection limits
# max_connections = (50 pods ร 16 workers ร 50 pool ร 1.2) + 200
config:
max_connections: 50000
shared_buffers: 8GB
effective_cache_size: 24GB
work_mem: 32MB
# --- Redis ---
redis:
enabled: true
resources:
limits:
cpu: 4000m
memory: 16Gi
requests:
cpu: 2000m
memory: 8Gi
persistence:
enabled: true
size: 50Gi
Deployment Stepsยถ
# 1. Create namespace
kubectl create namespace mcp-gateway
# 2. Create secrets
kubectl create secret generic mcp-secrets \
-n mcp-gateway \
--from-literal=JWT_SECRET_KEY=$(openssl rand -hex 32) \
--from-literal=AUTH_ENCRYPTION_SECRET=$(openssl rand -hex 32) \
--from-literal=POSTGRES_PASSWORD=$(openssl rand -base64 32)
# 3. Install with Helm
helm upgrade --install mcp-stack ./charts/mcp-stack \
-n mcp-gateway \
-f production-values.yaml \
--wait \
--timeout 10m
# 4. Verify deployment
kubectl get pods -n mcp-gateway
kubectl get hpa -n mcp-gateway
kubectl get svc -n mcp-gateway
# 5. Run migration job
kubectl get jobs -n mcp-gateway
# 6. Test scaling
kubectl top pods -n mcp-gateway
Pod Disruption Budgetยถ
# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: mcp-gateway-pdb
namespace: mcp-gateway
spec:
minAvailable: 3 # Keep 3 pods always running
selector:
matchLabels:
app: mcp-gateway
Network Policiesยถ
# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: mcp-gateway-policy
namespace: mcp-gateway
spec:
podSelector:
matchLabels:
app: mcp-gateway
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: ingress-nginx
ports:
- protocol: TCP
port: 4444
egress:
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
- to:
- podSelector:
matchLabels:
app: redis
ports:
- protocol: TCP
port: 6379
12. Monitoring and Observabilityยถ
OpenTelemetry Integrationยถ
MCP Gateway includes built-in OpenTelemetry support:
# Enable observability
OTEL_ENABLE_OBSERVABILITY=true
OTEL_TRACES_EXPORTER=otlp
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
OTEL_SERVICE_NAME=mcp-gateway
Prometheus Metricsยถ
Deploy Prometheus stack:
# Add Prometheus Helm repo
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring \
--create-namespace
Key Metrics to Monitorยถ
Application Metrics: - Request rate: rate(http_requests_total[1m])
- Latency: histogram_quantile(0.99, http_request_duration_seconds)
- Error rate: rate(http_requests_total{status=~"5.."}[1m])
System Metrics: - CPU usage: container_cpu_usage_seconds_total
- Memory usage: container_memory_working_set_bytes
- Network I/O: container_network_receive_bytes_total
Database Metrics: - Connection pool usage: db_pool_size
/ db_pool_connections_active
- Query latency: db_query_duration_seconds
- Deadlocks: pg_stat_database_deadlocks
HPA Metrics:
Grafana Dashboardsยถ
Import dashboards: 1. Kubernetes Cluster Monitoring (ID: 7249) 2. PostgreSQL (ID: 9628) 3. Redis (ID: 11835) 4. NGINX Ingress (ID: 9614)
Alerting Rulesยถ
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: mcp-gateway-alerts
namespace: monitoring
spec:
groups:
- name: mcp-gateway
interval: 30s
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5..", namespace="mcp-gateway"}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
annotations:
summary: "P99 latency exceeds 1s"
- alert: DatabaseConnectionPoolExhausted
expr: |
db_pool_connections_active / db_pool_size > 0.9
for: 2m
annotations:
summary: "Database connection pool >90% utilized"
Summary and Checklistยถ
Performance Technology Stackยถ
MCP Gateway is built on a high-performance foundation:
โ Pydantic v2.11+ - Rust-powered validation (5-50x faster than v1) โ FastAPI - Modern async framework with OpenAPI support โ Uvicorn - ASGI server with Rust-based HTTP parsing โ SQLAlchemy 2.0 - Async database operations โ Python 3.11+ - Current stable with excellent performance ๐ฎ Python 3.14 - Future free-threading support (beta)
Scaling Checklistยถ
- Vertical Scaling
- Configure Gunicorn workers:
(2 ร CPU) + 1
- Allocate CPU: 1 core per 2 workers
-
Allocate memory: 256MB + (workers ร 200MB)
-
Horizontal Scaling
- Deploy to Kubernetes with HPA enabled
- Set
minReplicas
โฅ 3 for high availability -
Configure shared PostgreSQL and Redis
-
Database Optimization
- Calculate
max_connections
:(pods ร workers ร pool) ร 1.2
- Set
DB_POOL_SIZE
per worker (recommended: 50) -
Configure
DB_POOL_RECYCLE=3600
to prevent stale connections -
Caching
- Enable Redis:
CACHE_TYPE=redis
- Set
REDIS_URL
to shared Redis instance -
Configure TTLs:
SESSION_TTL=3600
,MESSAGE_TTL=600
-
Performance
- Tune Gunicorn:
GUNICORN_PRELOAD_APP=true
- Set timeouts:
GUNICORN_TIMEOUT=600
-
Configure retries:
RETRY_MAX_ATTEMPTS=3
-
Health Checks
- Configure
/health
liveness probe - Configure
/ready
readiness probe -
Set appropriate thresholds and timeouts
-
Monitoring
- Enable OpenTelemetry:
OTEL_ENABLE_OBSERVABILITY=true
- Deploy Prometheus and Grafana
-
Configure alerts for errors, latency, and resources
-
Load Testing
- Benchmark with
hey
ork6
- Target: >1000 RPS per pod, P99 <500ms
- Test failover scenarios
Reference Documentationยถ
- Gunicorn Configuration
- Kubernetes Deployment
- Helm Charts
- Performance Testing
- Observability
- Configuration Guide
- Database Tuning
Additional Resourcesยถ
External Linksยถ
- Gunicorn Documentation
- Kubernetes HPA
- PostgreSQL Connection Pooling
- Redis Cluster
- OpenTelemetry Python
Communityยถ
Last updated: 2025-10-02