Skip to content

Scaling MCP Gatewayยถ

Comprehensive guide to scaling MCP Gateway from development to production, covering vertical scaling, horizontal scaling, connection pooling, performance tuning, and Kubernetes deployment strategies.

Overviewยถ

MCP Gateway is designed to scale from single-container development environments to distributed multi-node production deployments. This guide covers:

  • Vertical Scaling: Optimizing single-instance performance with Gunicorn workers
  • Horizontal Scaling: Multi-container deployments with shared state
  • Database Optimization: PostgreSQL connection pooling and settings
  • Cache Architecture: Redis for distributed caching
  • Performance Tuning: Configuration and benchmarking
  • Kubernetes Deployment: HPA, resource limits, and best practices

Table of Contentsยถ

  1. Understanding the GIL and Worker Architecture
  2. Vertical Scaling with Gunicorn
  3. Future: Python 3.14 and PostgreSQL 18
  4. Horizontal Scaling with Kubernetes
  5. Database Connection Pooling
  6. Redis for Distributed Caching
  7. Performance Tuning
  8. Benchmarking and Load Testing
  9. Health Checks and Readiness
  10. Stateless Architecture and Long-Running Connections
  11. Kubernetes Production Deployment
  12. Monitoring and Observability

1. Understanding the GIL and Worker Architectureยถ

The Python Global Interpreter Lock (GIL)ยถ

Python's Global Interpreter Lock (GIL) prevents multiple native threads from executing Python bytecode simultaneously. This means:

  • Single worker = Single CPU core usage (even on multi-core systems)
  • I/O-bound workloads (API calls, database queries) benefit from async/await
  • CPU-bound workloads (JSON parsing, encryption) require multiple processes

Pydantic v2: Rust-Powered Performanceยถ

MCP Gateway leverages Pydantic v2.11+ for all request/response validation and schema definitions. Unlike pure Python libraries, Pydantic v2 includes a Rust-based core (pydantic-core) that significantly improves performance:

Performance benefits: - 5-50x faster validation compared to Pydantic v1 - JSON parsing in Rust (bypasses GIL for serialization/deserialization) - Schema validation runs in compiled Rust code - Reduced CPU overhead for request processing

Impact on scaling: - 5,463 lines of Pydantic schemas (mcpgateway/schemas.py) - Every API request validated through Rust-optimized code - Lower CPU usage per request = higher throughput per worker - Rust components release the GIL during execution

This means that even within a single worker process, Pydantic's Rust core can run concurrently with Python code for validation-heavy workloads.

MCP Gateway's Solution: Gunicorn with Multiple Workersยถ

MCP Gateway uses Gunicorn with UvicornWorker to spawn multiple worker processes:

# gunicorn.config.py
workers = 8                    # Multiple processes bypass the GIL
worker_class = "uvicorn.workers.UvicornWorker"  # Async support
timeout = 600                  # 10-minute timeout for long-running operations
preload_app = True            # Load app once, then fork (memory efficient)

Key benefits:

  • Each worker is a separate process with its own GIL
  • 8 workers = ability to use 8 CPU cores
  • UvicornWorker enables async I/O within each worker
  • Preloading reduces memory footprint (shared code segments)

The trade-off is that you are running multiple Python interpreter instances, and each consumes additional memory.

This also requires having shared state (e.g. Redis or a Database).ยถ

2. Vertical Scaling with Gunicornยถ

Worker Count Calculationยถ

Formula: workers = (2 ร— CPU_cores) + 1

Examples:

CPU Cores Recommended Workers Use Case
1 2-3 Development/testing
2 4-5 Small production
4 8-9 Medium production
8 16-17 Large production

Configuration Methodsยถ

Environment Variablesยถ

# Automatic detection based on CPU cores
export GUNICORN_WORKERS=auto

# Manual override
export GUNICORN_WORKERS=16
export GUNICORN_TIMEOUT=600
export GUNICORN_MAX_REQUESTS=100000
export GUNICORN_MAX_REQUESTS_JITTER=100
export GUNICORN_PRELOAD_APP=true

Kubernetes ConfigMapยถ

# charts/mcp-stack/values.yaml
mcpContextForge:
  config:
    GUNICORN_WORKERS: "16"               # Number of worker processes
    GUNICORN_TIMEOUT: "600"              # Worker timeout (seconds)
    GUNICORN_MAX_REQUESTS: "100000"      # Requests before worker restart
    GUNICORN_MAX_REQUESTS_JITTER: "100"  # Prevents thundering herd
    GUNICORN_PRELOAD_APP: "true"         # Memory optimization

Resource Allocationยถ

CPU: Allocate 1 CPU core per 2 workers (allows for I/O wait)

Memory: - Base: 256MB - Per worker: 128-256MB (depending on workload) - Formula: memory = 256 + (workers ร— 200) MB

Example for 16 workers: - CPU: 8-10 cores (allows headroom) - Memory: 3.5-4 GB (256 + 16ร—200 = 3.5GB)

# Kubernetes resource limits
resources:
  limits:
    cpu: 10000m        # 10 cores
    memory: 4Gi
  requests:
    cpu: 8000m         # 8 cores
    memory: 3584Mi     # 3.5GB

3. Future: Python 3.14 and PostgreSQL 18ยถ

Python 3.14 (Free-Threaded Mode)ยถ

Status: Beta (as of July 2025) - PEP 703

Python 3.14 introduces optional free-threading (GIL removal), a groundbreaking change that enables true parallel multi-threading:

# Enable free-threading mode
python3.14 -X gil=0 -m gunicorn ...

# Or use PYTHON_GIL environment variable
PYTHON_GIL=0 python3.14 -m gunicorn ...

Performance characteristics:

Workload Type Expected Impact
Single-threaded 3-15% slower (overhead from thread-safety mechanisms)
Multi-threaded (I/O-bound) Minimal impact (already benefits from async/await)
Multi-threaded (CPU-bound) Near-linear scaling with CPU cores
Multi-process (current) No change (already bypasses GIL)

Benefits when available: - True parallel threads: Multiple threads execute Python code simultaneously - Lower memory overhead: Threads share memory (vs. separate processes) - Faster inter-thread communication: Shared memory, no IPC overhead - Better resource efficiency: One interpreter instance instead of multiple processes

Trade-offs: - Single-threaded penalty: 3-15% slower due to fine-grained locking - Library compatibility: Some C extensions need updates (most popular libraries already compatible) - Different scaling model: Move from workers=16 to workers=2 --threads=32

Migration strategy:

  1. Now (Python 3.11-3.13): Continue using multi-process Gunicorn

    workers = 16                    # Multiple processes
    worker_class = "uvicorn.workers.UvicornWorker"
    

  2. Python 3.14 beta: Test in staging environment

    # Build free-threaded Python
    ./configure --enable-experimental-jit --with-pydebug
    make
    
    # Test with free-threading
    PYTHON_GIL=0 python3.14 -m pytest tests/
    

  3. Python 3.14 stable: Evaluate hybrid approach

    workers = 4                     # Fewer processes
    threads = 8                     # More threads per process
    worker_class = "uvicorn.workers.UvicornWorker"
    

  4. Post-migration: Thread-based scaling

    workers = 2                     # Minimal processes
    threads = 32                    # Scale with threads
    preload_app = True              # Single app load
    

Current recommendation: - Production: Use Python 3.11-3.13 with multi-process Gunicorn (proven, stable) - Testing: Experiment with Python 3.14 beta in non-production environments - Monitoring: Watch for library compatibility announcements

Why MCP Gateway is well-positioned for free-threading:

MCP Gateway's architecture already benefits from components that will perform even better with Python 3.14:

  1. Pydantic v2 Rust core: Already bypasses GIL for validation - will work seamlessly with free-threading
  2. FastAPI/Uvicorn: Built for async I/O - natural fit for thread-based concurrency
  3. SQLAlchemy async: Database operations already non-blocking
  4. Stateless design: No shared mutable state between requests

Resources: - Python 3.14 Free-Threading Guide - PEP 703: Making the GIL Optional - Python 3.14 Release Schedule - Pydantic v2 Performance

PostgreSQL 18 (Async I/O)ยถ

Status: Development (expected 2025)

PostgreSQL 18 introduces native async I/O:

  • Improved connection handling: Better async query performance
  • Reduced latency: Non-blocking I/O operations
  • Better scalability: Efficient connection multiplexing

Current recommendation: PostgreSQL 16+ (stable async support via asyncpg)

# Production-ready now
DATABASE_URL=postgresql+asyncpg://user:pass@postgres:5432/mcp

4. Horizontal Scaling with Kubernetesยถ

Architecture Overviewยถ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Load Balancer                          โ”‚
โ”‚                    (Kubernetes Service)                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚                                โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  Gateway Pod 1   โ”‚            โ”‚  Gateway Pod 2   โ”‚
    โ”‚  (8 workers)     โ”‚            โ”‚  (8 workers)     โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚                                โ”‚
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ”‚                                       โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ PostgreSQL โ”‚                    โ”‚     Redis      โ”‚
    โ”‚  (shared)  โ”‚                    โ”‚   (shared)     โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Shared State Requirementsยถ

For multi-pod deployments:

  1. Shared PostgreSQL: All data (servers, tools, users, teams)
  2. Shared Redis: Distributed caching and session management
  3. Stateless pods: No local state, can be killed/restarted anytime

Kubernetes Deploymentยถ

Helm Chart Configurationยถ

# charts/mcp-stack/values.yaml
mcpContextForge:
  replicaCount: 3                   # Start with 3 pods

  # Horizontal Pod Autoscaler
  hpa:
    enabled: true
    minReplicas: 3                  # Never scale below 3
    maxReplicas: 20                 # Scale up to 20 pods
    targetCPUUtilizationPercentage: 70    # Scale at 70% CPU
    targetMemoryUtilizationPercentage: 80 # Scale at 80% memory

  # Pod resources
  resources:
    limits:
      cpu: 2000m                    # 2 cores per pod
      memory: 4Gi
    requests:
      cpu: 1000m                    # 1 core per pod
      memory: 2Gi

  # Environment configuration
  config:
    GUNICORN_WORKERS: "8"           # 8 workers per pod
    CACHE_TYPE: redis               # Shared cache
    DB_POOL_SIZE: "50"              # Per-pod pool size

# Shared PostgreSQL
postgres:
  enabled: true
  resources:
    limits:
      cpu: 4000m                    # 4 cores
      memory: 8Gi
    requests:
      cpu: 2000m
      memory: 4Gi

  # Important: Set max_connections
  # Formula: (num_pods ร— DB_POOL_SIZE ร— 1.2) + 20
  # Example: (20 pods ร— 50 pool ร— 1.2) + 20 = 1220
  config:
    max_connections: 1500           # Adjust based on scale

# Shared Redis
redis:
  enabled: true
  resources:
    limits:
      cpu: 2000m
      memory: 4Gi
    requests:
      cpu: 1000m
      memory: 2Gi

Deploy with Helmยถ

# Install/upgrade with custom values
helm upgrade --install mcp-stack ./charts/mcp-stack \
  --namespace mcp-gateway \
  --create-namespace \
  --values production-values.yaml

# Verify HPA
kubectl get hpa -n mcp-gateway

Horizontal Scaling Calculationยถ

Total capacity = pods ร— workers ร— requests_per_second

Example: - 10 pods ร— 8 workers ร— 100 RPS = 8,000 RPS

Database connections needed: - 10 pods ร— 50 pool size = 500 connections - Add 20% overhead = 600 connections - Set max_connections=1000 (buffer for maintenance)


5. Database Connection Poolingยถ

Connection Pool Architectureยถ

SQLAlchemy manages a connection pool per process:

Pod 1 (8 workers) โ†’ 8 connection pools โ†’ PostgreSQL
Pod 2 (8 workers) โ†’ 8 connection pools โ†’ PostgreSQL
Pod N (8 workers) โ†’ 8 connection pools โ†’ PostgreSQL

Pool Configurationยถ

Environment Variablesยถ

# Connection pool settings
DB_POOL_SIZE=50              # Persistent connections per worker
DB_MAX_OVERFLOW=10           # Additional connections allowed
DB_POOL_TIMEOUT=60           # Wait time before timeout (seconds)
DB_POOL_RECYCLE=3600         # Recycle connections after 1 hour
DB_MAX_RETRIES=5             # Retry attempts on failure
DB_RETRY_INTERVAL_MS=2000    # Retry interval

Configuration in Codeยถ

# mcpgateway/config.py
@property
def database_settings(self) -> dict:
    return {
        "pool_size": self.db_pool_size,          # 50
        "max_overflow": self.db_max_overflow,    # 10
        "pool_timeout": self.db_pool_timeout,    # 60s
        "pool_recycle": self.db_pool_recycle,    # 3600s
    }

PostgreSQL Configurationยถ

Calculate max_connectionsยถ

# Formula
max_connections = (num_pods ร— num_workers ร— pool_size ร— 1.2) + buffer

# Example: 10 pods, 8 workers, 50 pool size
max_connections = (10 ร— 8 ร— 50 ร— 1.2) + 200 = 5000 connections

PostgreSQL Configuration Fileยถ

# postgresql.conf
max_connections = 5000
shared_buffers = 16GB              # 25% of RAM
effective_cache_size = 48GB        # 75% of RAM
work_mem = 16MB                    # Per operation
maintenance_work_mem = 2GB

Managed Servicesยถ

IBM Cloud Databases for PostgreSQL:

# Increase max_connections via CLI
ibmcloud cdb deployment-configuration postgres \
  --configuration max_connections=5000

AWS RDS:

# Via parameter group
max_connections = {DBInstanceClassMemory/9531392}

Google Cloud SQL:

# Auto-scales based on instance size
# 4 vCPU = 400 connections
# 8 vCPU = 800 connections

Connection Pool Monitoringยถ

# Health endpoint checks pool status
@app.get("/health")
async def healthcheck(db: Session = Depends(get_db)):
    try:
        db.execute(text("SELECT 1"))
        return {"status": "healthy"}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}
# Check PostgreSQL connections
kubectl exec -it postgres-pod -- psql -U admin -d postgresdb \
  -c "SELECT count(*) FROM pg_stat_activity;"

6. Redis for Distributed Cachingยถ

Architectureยถ

Redis provides shared state across all Gateway pods:

  • Session storage: User sessions (TTL: 3600s)
  • Message cache: Ephemeral data (TTL: 600s)
  • Federation cache: Gateway peer discovery

Configurationยถ

Enable Redis Cachingยถ

# .env or Kubernetes ConfigMap
CACHE_TYPE=redis
REDIS_URL=redis://redis-service:6379/0
CACHE_PREFIX=mcpgw:
SESSION_TTL=3600
MESSAGE_TTL=600
REDIS_MAX_RETRIES=3
REDIS_RETRY_INTERVAL_MS=2000

Kubernetes Deploymentยถ

# charts/mcp-stack/values.yaml
redis:
  enabled: true

  resources:
    limits:
      cpu: 2000m
      memory: 4Gi
    requests:
      cpu: 1000m
      memory: 2Gi

  # Enable persistence
  persistence:
    enabled: true
    size: 10Gi

Redis Sizingยถ

Memory calculation: - Sessions: concurrent_users ร— 50KB - Messages: messages_per_minute ร— 100KB ร— (TTL/60)

Example: - 10,000 users ร— 50KB = 500MB - 1,000 msg/min ร— 100KB ร— 10min = 1GB - Total: 1.5GB + 50% overhead = 2.5GB

High Availabilityยถ

Redis Sentinel (3+ nodes):

redis:
  sentinel:
    enabled: true
    quorum: 2

  replicas: 3  # 1 primary + 2 replicas

Redis Cluster (6+ nodes):

REDIS_URL=redis://redis-cluster:6379/0?cluster=true


7. Performance Tuningยถ

Application Architecture Performanceยถ

MCP Gateway's technology stack is optimized for high performance:

Rust-Powered Components: - Pydantic v2 (5-50x faster validation via Rust core) - Uvicorn (ASGI server with Rust-based httptools)

Async-First Design: - FastAPI (async request handling) - SQLAlchemy 2.0 (async database operations) - asyncio event loop per worker

Performance characteristics: - Request validation: < 1ms (Pydantic v2 Rust core) - JSON serialization: 3-5x faster than pure Python - Database queries: Non-blocking async I/O - Concurrent requests per worker: 1000+ (async event loop)

System-Level Optimizationยถ

Kernel Parametersยถ

# /etc/sysctl.conf
net.core.somaxconn=4096
net.ipv4.tcp_max_syn_backlog=4096
net.ipv4.ip_local_port_range=1024 65535
net.ipv4.tcp_tw_reuse=1
fs.file-max=2097152

# Apply changes
sysctl -p

File Descriptorsยถ

# /etc/security/limits.conf
* soft nofile 1048576
* hard nofile 1048576

# Verify
ulimit -n

Gunicorn Tuningยถ

Optimal Settingsยถ

# gunicorn.config.py
workers = (CPU_cores ร— 2) + 1
timeout = 600                    # Long enough for LLM calls
max_requests = 100000            # Prevent memory leaks
max_requests_jitter = 100        # Randomize restart
preload_app = True              # Reduce memory
reuse_port = True               # Load balance across workers

Worker Class Selectionยถ

UvicornWorker (default - best for async):

worker_class = "uvicorn.workers.UvicornWorker"

Gevent (alternative for I/O-heavy):

pip install gunicorn[gevent]
worker_class = "gevent"
worker_connections = 1000

Application Tuningยถ

# Resource limits
TOOL_TIMEOUT=60
TOOL_CONCURRENT_LIMIT=10
RESOURCE_CACHE_SIZE=1000
RESOURCE_CACHE_TTL=3600

# Retry configuration
RETRY_MAX_ATTEMPTS=3
RETRY_BASE_DELAY=1.0
RETRY_MAX_DELAY=60

# Health check intervals
HEALTH_CHECK_INTERVAL=60
HEALTH_CHECK_TIMEOUT=10
UNHEALTHY_THRESHOLD=3

8. Benchmarking and Load Testingยถ

Toolsยถ

hey - HTTP load generator

# Install
brew install hey           # macOS
sudo apt install hey       # Ubuntu

# Or from source
go install github.com/rakyll/hey@latest

k6 - Modern load testing

brew install k6            # macOS

Baseline Testยถ

Prepare Environmentยถ

# Get JWT token
export MCPGATEWAY_BEARER_TOKEN=$(python3 -m mcpgateway.utils.create_jwt_token \
  --username admin@example.com --exp 0 --secret my-test-key)

# Create test payload
cat > payload.json <<EOF
{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/list",
  "params": {}
}
EOF

Run Load Testยถ

#!/bin/bash
# test-load.sh

# Test parameters
REQUESTS=10000
CONCURRENCY=200
URL="http://localhost:4444/"

# Run test
hey -n $REQUESTS -c $CONCURRENCY \
    -m POST \
    -T application/json \
    -H "Authorization: Bearer $MCPGATEWAY_BEARER_TOKEN" \
    -D payload.json \
    $URL

Interpret Resultsยถ

Summary:
  Total:        5.2341 secs
  Slowest:      0.5234 secs
  Fastest:      0.0123 secs
  Average:      0.1045 secs
  Requests/sec: 1910.5623      โ† Target metric

Status code distribution:
  [200] 10000 responses

Response time histogram:
  0.012 [1]     |
  0.050 [2341]  |โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– 
  0.100 [4523]  |โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– 
  0.150 [2234]  |โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– 
  0.200 [901]   |โ– โ– โ– โ– 
  0.250 [0]     |

Key metrics: - Requests/sec: Throughput (target: >1000 RPS per pod) - P99 latency: 99th percentile (target: <500ms) - Error rate: 5xx responses (target: <0.1%)

Kubernetes Load Testยถ

# Deploy test pod
kubectl run load-test --image=williamyeh/hey:latest \
  --rm -it --restart=Never -- \
  -n 100000 -c 500 \
  -H "Authorization: Bearer $TOKEN" \
  http://mcp-gateway-service/

Advanced: k6 Scriptยถ

// load-test.k6.js
import http from 'k6/http';
import { check } from 'k6';

export let options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up
    { duration: '5m', target: 100 },   // Sustained
    { duration: '2m', target: 500 },   // Spike
    { duration: '5m', target: 500 },   // High load
    { duration: '2m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],  // 99% < 500ms
    http_req_failed: ['rate<0.01'],    // <1% errors
  },
};

export default function () {
  const payload = JSON.stringify({
    jsonrpc: '2.0',
    id: 1,
    method: 'tools/list',
    params: {},
  });

  const res = http.post('http://localhost:4444/', payload, {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${__ENV.TOKEN}`,
    },
  });

  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });
}
# Run k6 test
TOKEN=$MCPGATEWAY_BEARER_TOKEN k6 run load-test.k6.js

9. Health Checks and Readinessยถ

Health Check Endpointsยถ

MCP Gateway provides two health endpoints:

Liveness Probe: /healthยถ

Purpose: Is the application alive?

@app.get("/health")
async def healthcheck(db: Session = Depends(get_db)):
    """Check database connectivity"""
    try:
        db.execute(text("SELECT 1"))
        return {"status": "healthy"}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}

Response:

{
  "status": "healthy"
}

Readiness Probe: /readyยถ

Purpose: Is the application ready to receive traffic?

@app.get("/ready")
async def readiness_check(db: Session = Depends(get_db)):
    """Check if ready to serve traffic"""
    try:
        await asyncio.to_thread(db.execute, text("SELECT 1"))
        return JSONResponse({"status": "ready"}, status_code=200)
    except Exception as e:
        return JSONResponse(
            {"status": "not ready", "error": str(e)},
            status_code=503
        )

Kubernetes Probe Configurationยถ

# charts/mcp-stack/templates/deployment-mcpgateway.yaml
containers:
  - name: mcp-context-forge

    # Startup probe (initial readiness)
    startupProbe:
      exec:
        command:
          - python3
          - /app/mcpgateway/utils/db_isready.py
          - --max-tries=1
          - --timeout=2
      initialDelaySeconds: 10
      periodSeconds: 5
      failureThreshold: 60        # 5 minutes max

    # Readiness probe (traffic routing)
    readinessProbe:
      httpGet:
        path: /ready
        port: 4444
      initialDelaySeconds: 15
      periodSeconds: 10
      timeoutSeconds: 2
      successThreshold: 1
      failureThreshold: 3

    # Liveness probe (restart if unhealthy)
    livenessProbe:
      httpGet:
        path: /health
        port: 4444
      initialDelaySeconds: 10
      periodSeconds: 15
      timeoutSeconds: 2
      successThreshold: 1
      failureThreshold: 3

Probe Tuning Guidelinesยถ

Startup Probe: - Use for slow initialization (database migrations, model loading) - failureThreshold ร— periodSeconds = max startup time - Example: 60 ร— 5s = 5 minutes

Readiness Probe: - Aggressive: Remove pod from load balancer quickly - failureThreshold = 3 (fail fast) - periodSeconds = 10 (frequent checks)

Liveness Probe: - Conservative: Avoid unnecessary restarts - failureThreshold = 5 (tolerate transient issues) - periodSeconds = 15 (less frequent)

Monitoring Healthยถ

# Check pod health
kubectl get pods -n mcp-gateway

# Detailed status
kubectl describe pod <pod-name> -n mcp-gateway

# Check readiness
kubectl get pods -n mcp-gateway \
  -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}'

# Test health endpoint
kubectl exec -it <pod-name> -n mcp-gateway -- \
  curl http://localhost:4444/health

# View probe failures
kubectl get events -n mcp-gateway \
  --field-selector involvedObject.name=<pod-name>

10. Stateless Architecture and Long-Running Connectionsยถ

Stateless Design Principlesยถ

MCP Gateway is designed to be stateless, enabling horizontal scaling:

  1. No local session storage: All sessions in Redis
  2. No in-memory caching (in production): Use Redis
  3. Database-backed state: All data in PostgreSQL
  4. Shared configuration: Environment variables via ConfigMap

Session Managementยถ

USE_STATEFUL_SESSIONS=true  # Event store in database

Limitations: - Sessions tied to specific pods - Requires sticky sessions (session affinity) - Doesn't scale horizontally

USE_STATEFUL_SESSIONS=false
JSON_RESPONSE_ENABLED=true
CACHE_TYPE=redis

Benefits: - Any pod can handle any request - True horizontal scaling - Automatic failover

Long-Running Connectionsยถ

MCP Gateway supports long-running connections for streaming:

Server-Sent Events (SSE)ยถ

# Endpoint: /servers/{id}/sse
@app.get("/servers/{server_id}/sse")
async def sse_endpoint(server_id: int):
    """Stream events to client"""
    # Connection can last minutes/hours

WebSocketยถ

# Endpoint: /servers/{id}/ws
@app.websocket("/servers/{server_id}/ws")
async def websocket_endpoint(server_id: int):
    """Bidirectional streaming"""

Load Balancer Configurationยถ

Kubernetes Service (default):

# Distributes connections across pods
apiVersion: v1
kind: Service
metadata:
  name: mcp-gateway-service
spec:
  type: ClusterIP
  sessionAffinity: None        # No sticky sessions
  ports:
    - port: 80
      targetPort: 4444

NGINX Ingress (for WebSocket):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    nginx.ingress.kubernetes.io/websocket-services: "mcp-gateway-service"
spec:
  rules:
    - host: gateway.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: mcp-gateway-service
                port:
                  number: 80

Connection Lifecycleยถ

Client โ†’ Load Balancer โ†’ Pod A (SSE stream)
                โ†“
            (Pod A dies)
                โ†“
Client โ† Load Balancer โ†’ Pod B (reconnect)

Best practices: 1. Client implements reconnection logic 2. Server sets SSE_KEEPALIVE_INTERVAL=30 (keepalive events) 3. Load balancer timeout > keepalive interval


11. Kubernetes Production Deploymentยถ

Reference Architectureยถ

# production-values.yaml
mcpContextForge:
  # --- Scaling ---
  replicaCount: 5

  hpa:
    enabled: true
    minReplicas: 5
    maxReplicas: 50
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80

  # --- Resources ---
  resources:
    limits:
      cpu: 4000m          # 4 cores per pod
      memory: 8Gi
    requests:
      cpu: 2000m          # 2 cores per pod
      memory: 4Gi

  # --- Configuration ---
  config:
    # Gunicorn
    GUNICORN_WORKERS: "16"
    GUNICORN_TIMEOUT: "600"
    GUNICORN_MAX_REQUESTS: "100000"
    GUNICORN_PRELOAD_APP: "true"

    # Database
    DB_POOL_SIZE: "50"
    DB_MAX_OVERFLOW: "10"
    DB_POOL_TIMEOUT: "60"
    DB_POOL_RECYCLE: "3600"

    # Cache
    CACHE_TYPE: redis
    CACHE_PREFIX: mcpgw:
    SESSION_TTL: "3600"
    MESSAGE_TTL: "600"

    # Performance
    TOOL_CONCURRENT_LIMIT: "20"
    RESOURCE_CACHE_SIZE: "2000"

  # --- Health Checks ---
  probes:
    startup:
      type: exec
      command: ["python3", "/app/mcpgateway/utils/db_isready.py"]
      periodSeconds: 5
      failureThreshold: 60

    readiness:
      type: http
      path: /ready
      port: 4444
      periodSeconds: 10
      failureThreshold: 3

    liveness:
      type: http
      path: /health
      port: 4444
      periodSeconds: 15
      failureThreshold: 5

# --- PostgreSQL ---
postgres:
  enabled: true

  resources:
    limits:
      cpu: 8000m          # 8 cores
      memory: 32Gi
    requests:
      cpu: 4000m
      memory: 16Gi

  persistence:
    enabled: true
    size: 100Gi
    storageClassName: fast-ssd

  # Connection limits
  # max_connections = (50 pods ร— 16 workers ร— 50 pool ร— 1.2) + 200
  config:
    max_connections: 50000
    shared_buffers: 8GB
    effective_cache_size: 24GB
    work_mem: 32MB

# --- Redis ---
redis:
  enabled: true

  resources:
    limits:
      cpu: 4000m
      memory: 16Gi
    requests:
      cpu: 2000m
      memory: 8Gi

  persistence:
    enabled: true
    size: 50Gi

Deployment Stepsยถ

# 1. Create namespace
kubectl create namespace mcp-gateway

# 2. Create secrets
kubectl create secret generic mcp-secrets \
  -n mcp-gateway \
  --from-literal=JWT_SECRET_KEY=$(openssl rand -hex 32) \
  --from-literal=AUTH_ENCRYPTION_SECRET=$(openssl rand -hex 32) \
  --from-literal=POSTGRES_PASSWORD=$(openssl rand -base64 32)

# 3. Install with Helm
helm upgrade --install mcp-stack ./charts/mcp-stack \
  -n mcp-gateway \
  -f production-values.yaml \
  --wait \
  --timeout 10m

# 4. Verify deployment
kubectl get pods -n mcp-gateway
kubectl get hpa -n mcp-gateway
kubectl get svc -n mcp-gateway

# 5. Run migration job
kubectl get jobs -n mcp-gateway

# 6. Test scaling
kubectl top pods -n mcp-gateway

Pod Disruption Budgetยถ

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: mcp-gateway-pdb
  namespace: mcp-gateway
spec:
  minAvailable: 3         # Keep 3 pods always running
  selector:
    matchLabels:
      app: mcp-gateway

Network Policiesยถ

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: mcp-gateway-policy
  namespace: mcp-gateway
spec:
  podSelector:
    matchLabels:
      app: mcp-gateway
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: ingress-nginx
      ports:
        - protocol: TCP
          port: 4444
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: postgres
      ports:
        - protocol: TCP
          port: 5432
    - to:
        - podSelector:
            matchLabels:
              app: redis
      ports:
        - protocol: TCP
          port: 6379

12. Monitoring and Observabilityยถ

OpenTelemetry Integrationยถ

MCP Gateway includes built-in OpenTelemetry support:

# Enable observability
OTEL_ENABLE_OBSERVABILITY=true
OTEL_TRACES_EXPORTER=otlp
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
OTEL_SERVICE_NAME=mcp-gateway

Prometheus Metricsยถ

Deploy Prometheus stack:

# Add Prometheus Helm repo
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts

# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring \
  --create-namespace

Key Metrics to Monitorยถ

Application Metrics: - Request rate: rate(http_requests_total[1m]) - Latency: histogram_quantile(0.99, http_request_duration_seconds) - Error rate: rate(http_requests_total{status=~"5.."}[1m])

System Metrics: - CPU usage: container_cpu_usage_seconds_total - Memory usage: container_memory_working_set_bytes - Network I/O: container_network_receive_bytes_total

Database Metrics: - Connection pool usage: db_pool_size / db_pool_connections_active - Query latency: db_query_duration_seconds - Deadlocks: pg_stat_database_deadlocks

HPA Metrics:

kubectl get hpa -n mcp-gateway -w

Grafana Dashboardsยถ

Import dashboards: 1. Kubernetes Cluster Monitoring (ID: 7249) 2. PostgreSQL (ID: 9628) 3. Redis (ID: 11835) 4. NGINX Ingress (ID: 9614)

Alerting Rulesยถ

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: mcp-gateway-alerts
  namespace: monitoring
spec:
  groups:
    - name: mcp-gateway
      interval: 30s
      rules:
        - alert: HighErrorRate
          expr: |
            rate(http_requests_total{status=~"5..", namespace="mcp-gateway"}[5m]) > 0.05
          for: 5m
          annotations:
            summary: "High error rate detected"

        - alert: HighLatency
          expr: |
            histogram_quantile(0.99,
              rate(http_request_duration_seconds_bucket[5m])) > 1
          for: 5m
          annotations:
            summary: "P99 latency exceeds 1s"

        - alert: DatabaseConnectionPoolExhausted
          expr: |
            db_pool_connections_active / db_pool_size > 0.9
          for: 2m
          annotations:
            summary: "Database connection pool >90% utilized"

Summary and Checklistยถ

Performance Technology Stackยถ

MCP Gateway is built on a high-performance foundation:

โœ… Pydantic v2.11+ - Rust-powered validation (5-50x faster than v1) โœ… FastAPI - Modern async framework with OpenAPI support โœ… Uvicorn - ASGI server with Rust-based HTTP parsing โœ… SQLAlchemy 2.0 - Async database operations โœ… Python 3.11+ - Current stable with excellent performance ๐Ÿ”ฎ Python 3.14 - Future free-threading support (beta)

Scaling Checklistยถ

  • Vertical Scaling
  • Configure Gunicorn workers: (2 ร— CPU) + 1
  • Allocate CPU: 1 core per 2 workers
  • Allocate memory: 256MB + (workers ร— 200MB)

  • Horizontal Scaling

  • Deploy to Kubernetes with HPA enabled
  • Set minReplicas โ‰ฅ 3 for high availability
  • Configure shared PostgreSQL and Redis

  • Database Optimization

  • Calculate max_connections: (pods ร— workers ร— pool) ร— 1.2
  • Set DB_POOL_SIZE per worker (recommended: 50)
  • Configure DB_POOL_RECYCLE=3600 to prevent stale connections

  • Caching

  • Enable Redis: CACHE_TYPE=redis
  • Set REDIS_URL to shared Redis instance
  • Configure TTLs: SESSION_TTL=3600, MESSAGE_TTL=600

  • Performance

  • Tune Gunicorn: GUNICORN_PRELOAD_APP=true
  • Set timeouts: GUNICORN_TIMEOUT=600
  • Configure retries: RETRY_MAX_ATTEMPTS=3

  • Health Checks

  • Configure /health liveness probe
  • Configure /ready readiness probe
  • Set appropriate thresholds and timeouts

  • Monitoring

  • Enable OpenTelemetry: OTEL_ENABLE_OBSERVABILITY=true
  • Deploy Prometheus and Grafana
  • Configure alerts for errors, latency, and resources

  • Load Testing

  • Benchmark with hey or k6
  • Target: >1000 RPS per pod, P99 <500ms
  • Test failover scenarios

Reference Documentationยถ


Additional Resourcesยถ

Communityยถ


Last updated: 2025-10-02