Performance Profiling Guide¶
This guide covers tools and techniques for profiling ContextForge performance under load. Use these methods to identify bottlenecks, optimize queries, and diagnose production issues.
Quick Reference¶
| Tool | Purpose | When to Use |
|---|---|---|
| Locust | Load testing | Simulate concurrent users |
| PostgreSQL EXPLAIN | Query analysis | Find slow/inefficient queries |
| pg_stat_activity | Connection monitoring | Debug idle transactions |
| pg_stat_user_tables | Table scan stats | Find full table scans |
| py-spy | Python CPU profiling | Find CPU hotspots |
| memray | Python memory profiling | Find memory leaks and allocation hotspots |
| docker stats | Resource monitoring | Track CPU/memory usage |
| Redis CLI | Cache analysis | Check hit rates |
| perf / cargo flamegraph | Rust CPU profiling | Inspect Rust MCP runtime hotspots |
Load Testing with Locust¶
Starting a Load Test¶
# Start Locust web UI
make load-test-ui
# Open browser to http://localhost:8089
# Configure users (e.g., 3000) and spawn rate (e.g., 100/s)
Monitoring Locust Stats via API¶
# Get current stats as JSON
curl -s http://localhost:8089/stats/requests | python3 -c "
import sys, json
data = json.load(sys.stdin)
print('=== TOP SLOWEST ENDPOINTS ===')
stats = sorted(data.get('stats', []), key=lambda x: x.get('avg_response_time', 0), reverse=True)[:10]
print(f\"{'Endpoint':<45} {'Reqs':>8} {'Avg':>8} {'P95':>8} {'P99':>8}\")
print('-' * 85)
for s in stats:
name = s.get('name', '')[:43]
p95 = s.get('response_time_percentile_0.95', 0)
p99 = s.get('response_time_percentile_0.99', 0)
print(f\"{name:<45} {s.get('num_requests', 0):>8} {s.get('avg_response_time', 0):>8.0f} {p95:>8.0f} {p99:>8.0f}\")
print()
print(f\"RPS: {data.get('total_rps', 0):.1f}, Users: {data.get('user_count', 0)}, Failures: {data.get('total_fail_count', 0)}\")
"
Checking for Errors¶
curl -s http://localhost:8089/stats/requests | python3 -c "
import sys, json
data = json.load(sys.stdin)
print('=== ERRORS ===')
for e in data.get('errors', []):
print(f\" {e.get('name')}: {e.get('occurrences')} - {e.get('error')[:80]}\")
"
PostgreSQL Profiling¶
EXPLAIN ANALYZE¶
Use EXPLAIN ANALYZE to understand query execution plans and find slow queries:
docker exec mcp-context-forge-postgres-1 psql -U postgres -d mcp -c "
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT COUNT(*), AVG(response_time)
FROM tool_metrics
WHERE timestamp >= NOW() - INTERVAL '7 days';
"
Key metrics to watch:
| Metric | Good | Bad |
|---|---|---|
Seq Scan | On small tables (<1000 rows) | On large tables |
Index Scan | On filtered queries | Missing when expected |
Rows Removed by Filter: 0 | Filter matches few rows | Filter matches all rows |
Shared Buffers Hit | High ratio | Low ratio (disk I/O) |
Example: Detecting Non-Selective Filters
Parallel Seq Scan on tool_metrics
Filter: (timestamp >= (now() - '7 days'::interval))
Rows Removed by Filter: 0 <-- ALL rows match = index not useful
This indicates the filter matches 100% of rows, so PostgreSQL chooses a sequential scan over an index scan.
Table Scan Statistics¶
Monitor which tables are being scanned excessively:
docker exec mcp-context-forge-postgres-1 psql -U postgres -d mcp -c "
SELECT
relname as table_name,
pg_size_pretty(pg_total_relation_size(relid)) as total_size,
n_live_tup as live_rows,
seq_scan,
seq_tup_read,
idx_scan,
CASE WHEN seq_scan > 0 THEN seq_tup_read / seq_scan ELSE 0 END as avg_rows_per_seq_scan
FROM pg_stat_user_tables
ORDER BY seq_tup_read DESC
LIMIT 15;
"
Warning signs:
seq_tup_readin billions = excessive full table scansavg_rows_per_seq_scanequalslive_rows= scanning entire table each time- High
seq_scancount with large tables = missing index or non-selective filter
Connection State Analysis¶
Check for idle-in-transaction connections (a sign of long-running requests or connection leaks):
docker exec mcp-context-forge-postgres-1 psql -U postgres -d mcp -c "
SELECT
state,
COUNT(*) as count,
MAX(EXTRACT(EPOCH FROM (NOW() - state_change)))::int as max_age_seconds
FROM pg_stat_activity
WHERE datname = 'mcp'
GROUP BY state
ORDER BY count DESC;
"
Healthy state:
state | count | max_age_seconds
--------------------+-------+-----------------
idle | 70 | 200
active | 5 | 0
idle in transaction | 3 | 1
Unhealthy state (connection exhaustion risk):
state | count | max_age_seconds
--------------------+-------+-----------------
idle in transaction | 60 | 120 <-- Problem!
idle | 38 | 500
active | 2 | 0
Finding Stuck Queries¶
docker exec mcp-context-forge-postgres-1 psql -U postgres -d mcp -c "
SELECT
pid,
state,
EXTRACT(EPOCH FROM (NOW() - state_change))::numeric(8,2) as idle_seconds,
LEFT(query, 100) as query_snippet
FROM pg_stat_activity
WHERE datname = 'mcp' AND state = 'idle in transaction'
ORDER BY state_change
LIMIT 15;
"
Reset Statistics¶
To get fresh statistics for a specific test:
Python Profiling with py-spy¶
py-spy is a sampling profiler for Python that can attach to running processes without code changes.
Installing py-spy¶
Profiling a Running Container¶
# Find the Python process ID
docker exec mcp-context-forge-gateway-1 ps aux | grep python
# Run py-spy from host (requires root)
sudo py-spy top --pid $(docker inspect --format '{{.State.Pid}}' mcp-context-forge-gateway-1)
# Generate a flamegraph
sudo py-spy record -o profile.svg --pid $(docker inspect --format '{{.State.Pid}}' mcp-context-forge-gateway-1) --duration 30
Profiling Locally¶
# Profile the development server
py-spy top -- python -m mcpgateway
# Generate flamegraph
py-spy record -o flamegraph.svg -- python -m mcpgateway
Interpreting Flamegraphs¶
- Wide bars = functions consuming the most CPU time
- Deep stacks = many nested function calls
- Look for: Template rendering, JSON serialization, database queries
Rust MCP Runtime Profiling¶
For Rust-local profiling of the MCP runtime crate:
make -C tools_rust/mcp_runtime setup-profiling
make -C tools_rust/mcp_runtime flamegraph-test
make -C tools_rust/mcp_runtime flamegraph-test-rmcp
These targets generate flamegraphs under:
Use them to inspect Rust-internal startup and hot-path behavior in the runtime crate itself.
For live profiling of the compose-backed Rust runtime under load:
ps -eo pid,cmd | grep contextforge-mcp-runtime
sudo perf record -F 99 -g -p <pid> -- sleep 20
sudo perf report --stdio
Use live perf during a real benchmark when you want steady-state behavior. Use the crate-local flamegraph targets when you want in-process Rust visibility without the rest of the stack.
Memory Profiling with memray¶
memray is a memory profiler for Python that tracks allocations in Python code, native extension modules, and the Python interpreter itself. It's ideal for finding memory leaks, high-water marks, and allocation hotspots.
Installing memray¶
Profiling Locally¶
# Run your application with memray tracking
memray run -o output.bin python -m mcpgateway
# Or run a specific script
memray run -o output.bin python script.py
Attaching to a Running Process¶
memray can attach to an already-running Python process to capture memory allocations:
# Find the Python process ID inside the container
docker exec mcp-context-forge-gateway-1 ps aux | grep python
# Attach memray to a running process (requires ptrace permissions)
# Option 1: Run memray inside the container
docker exec -it mcp-context-forge-gateway-1 memray attach <PID> -o /tmp/profile.bin
# Option 2: If using privileged container or with SYS_PTRACE capability
docker exec mcp-context-forge-gateway-1 memray attach --aggregate <PID> -o /tmp/profile.bin
# After capturing, copy the profile out
docker cp mcp-context-forge-gateway-1:/tmp/profile.bin ./profile.bin
Note: memray attach requires ptrace permissions. You may need to run the container with --cap-add=SYS_PTRACE or in privileged mode for profiling.
Generating Reports¶
memray provides multiple output formats:
# Interactive flamegraph (opens in browser)
memray flamegraph output.bin -o flamegraph.html
# Table view (terminal-friendly)
memray table output.bin
# Tree view (call hierarchy)
memray tree output.bin
# Summary statistics
memray stats output.bin
# Identify memory leaks
memray summary output.bin
Live Mode (Real-time Monitoring)¶
For development, use live mode to see allocations in real-time:
# Run with live TUI
memray run --live python -m mcpgateway
# Attach to running process with live mode
memray attach --live <PID>
Container Profiling Workflow¶
Complete workflow for profiling a gateway container:
# 1. Install memray in the container (if not already installed)
docker exec mcp-context-forge-gateway-1 pip install memray
# 2. Find worker PIDs
docker exec mcp-context-forge-gateway-1 ps aux | grep "mcpgateway work" | head -5
# 3. Attach to one worker (e.g., PID 123) for 60 seconds
docker exec mcp-context-forge-gateway-1 timeout 60 memray attach 123 -o /tmp/worker_profile.bin || true
# 4. Copy profile to host
docker cp mcp-context-forge-gateway-1:/tmp/worker_profile.bin ./worker_profile.bin
# 5. Generate reports
memray flamegraph worker_profile.bin -o memory_flamegraph.html
memray stats worker_profile.bin
memray table worker_profile.bin | head -50
Container Profiling Limitations:
memray attachrequiresgdborlldb, which may not be available in minimal containers- Python version must match between memray and the target process (e.g., memray compiled for Python 3.13 won't work with Python 3.12 containers)
- Requires ptrace permissions (
--cap-add=SYS_PTRACEor privileged mode) -
For production containers without pip, consider:
-
Building a debug image with memray pre-installed
- Using
memray runlocally to reproduce the issue - Using py-spy for CPU profiling (works cross-version and is more portable)
Interpreting memray Output¶
Flamegraph:
- Width = amount of memory allocated by that call stack
- Color: Red = Python code, Green = C extensions, Blue = Python internals
- Click on frames to zoom in
Table view columns:
Total memory= all memory allocated by this function and its calleesOwn memory= memory allocated directly by this functionAllocations= number of allocation calls
Common patterns to look for:
- Large allocations in template rendering (Jinja2)
- JSON serialization of large datasets
- ORM model instantiation (SQLAlchemy)
- Response buffering in ASGI middleware
- Caches growing unbounded
Example high-memory patterns:
# Pattern: Large list comprehensions in API responses
mcpgateway/main.py:handle_rpc Total: 500MB Own: 450MB Allocations: 10000
# Pattern: Template rendering accumulating data
jinja2/environment.py:render Total: 200MB Own: 50MB Allocations: 5000
py-spy vs memray¶
| Aspect | py-spy | memray |
|---|---|---|
| Focus | CPU time | Memory allocation |
| Overhead | Very low (~1%) | Medium (10-30%) |
| Attach support | Yes | Yes |
| Native code | No | Yes |
| Use when | High CPU usage | OOM errors, memory leaks |
Use py-spy when CPU is the bottleneck. Use memray when memory usage is high or you're seeing OOM kills.
Container Resource Monitoring¶
Real-time Stats¶
# Watch all containers
docker stats
# Filter to specific containers
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" \
mcp-context-forge-gateway-1 \
mcp-context-forge-postgres-1 \
mcp-context-forge-redis-1
Snapshot Stats¶
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" \
| grep -E "gateway|postgres|redis|nginx"
Healthy resource usage:
| Container | CPU | Memory |
|---|---|---|
| gateway (each) | <400% | <4GB |
| postgres | <150% | <1GB |
| redis | <20% | <100MB |
Redis Cache Analysis¶
Check Hit Rate¶
docker exec mcp-context-forge-redis-1 redis-cli info stats | grep -E "keyspace|ops_per_sec|hits|misses"
Calculate hit rate:
docker exec mcp-context-forge-redis-1 redis-cli info stats | python3 -c "
import sys
stats = {}
for line in sys.stdin:
if ':' in line:
k, v = line.strip().split(':')
stats[k] = int(v) if v.isdigit() else v
hits = stats.get('keyspace_hits', 0)
misses = stats.get('keyspace_misses', 0)
total = hits + misses
hit_rate = (hits / total * 100) if total > 0 else 0
print(f'Hits: {hits}, Misses: {misses}, Hit Rate: {hit_rate:.1f}%')
"
Good hit rate: >90% for cached data
Check Key Counts¶
docker exec mcp-context-forge-redis-1 redis-cli dbsize
# List keys by pattern
docker exec mcp-context-forge-redis-1 redis-cli keys "mcpgw:*" | head -20
Tool lookup cache keys (invoke hot path):
Gateway Log Analysis¶
Check for Errors¶
docker logs mcp-context-forge-gateway-1 2>&1 | grep -iE "error|exception|timeout|warning" | tail -30
Count Error Types¶
docker logs mcp-context-forge-gateway-1 2>&1 | grep -i "error" | \
sed 's/.*\(Error[^:]*\).*/\1/' | sort | uniq -c | sort -rn | head -10
Check for Idle Transaction Timeouts¶
Complete Profiling Session Example¶
Here's a workflow for diagnosing performance issues under load:
# 1. Reset PostgreSQL statistics
docker exec mcp-context-forge-postgres-1 psql -U postgres -d mcp -c "SELECT pg_stat_reset();"
# 2. Start load test
make load-test-ui
# Configure 3000 users in browser, start test
# 3. Take samples every 30 seconds
for i in {1..5}; do
echo "=== SAMPLE $i ==="
# Locust stats
curl -s http://localhost:8089/stats/requests | python3 -c "
import sys, json
d = json.load(sys.stdin)
admin = next((s for s in d.get('stats', []) if s.get('name') == '/admin/'), {})
print(f\"RPS: {d.get('total_rps', 0):.0f}, /admin/ avg: {admin.get('avg_response_time', 0):.0f}ms\")
"
# Connection states
docker exec mcp-context-forge-postgres-1 psql -U postgres -d mcp -c "
SELECT state, COUNT(*) FROM pg_stat_activity WHERE datname='mcp' GROUP BY state;
"
# Container CPU
docker stats --no-stream --format "{{.Name}}: {{.CPUPerc}}" | grep gateway
sleep 30
done
# 4. Final analysis
docker exec mcp-context-forge-postgres-1 psql -U postgres -d mcp -c "
SELECT relname, seq_scan, seq_tup_read, idx_scan
FROM pg_stat_user_tables
ORDER BY seq_tup_read DESC LIMIT 10;
"
MCP Protocol Profiling¶
The MCP Streamable HTTP transport (/servers/{id}/mcp) has a different performance profile from the REST API (/rpc). Use the dedicated MCP load test to isolate protocol overhead.
MCP vs REST: Quick Comparison¶
# Run MCP-only load test
make load-test-mcp-protocol
# Compare with general load test (includes REST + admin)
make load-test-cli
The MCP path processes requests through the MCP SDK session manager, which adds JSON-RPC parsing, context variable management, and per-request auth/RBAC database queries. The /rpc endpoint uses Redis-backed caching for tool lookups and auth, which the MCP transport path does not fully leverage. Under load, this manifests as significantly higher PgBouncer and PostgreSQL CPU on MCP workloads vs REST workloads for the same RPS.
Bottleneck Triage Table¶
Use this table to identify which layer is the bottleneck:
| Symptom | Bottleneck Layer | Investigation |
|---|---|---|
| Gateway CPU >300% per replica | Gateway compute (middleware, MCP SDK) | py-spy flamegraph on a gateway worker |
| PgBouncer CPU >80% | Database connection pressure | Check pg_stat_activity, reduce DB queries per request |
| PostgreSQL CPU >100% | Query overhead (seq scans, RBAC lookups) | pg_stat_user_tables for seq scan counts |
| Redis CPU <1% during MCP load | MCP path not using Redis cache | Compare with /rpc which does use Redis |
| Upstream MCP server CPU high | Tool execution overhead | Profile the upstream MCP server separately |
| nginx CPU >30% | Proxy overhead (unlikely) | Check keepalive, connection reuse |
| High p99 but low p50 | Tail latency from GC or lock contention | py-spy dump to check for blocked threads |
MCP Profiling Session¶
# 1. Reset DB stats
docker exec mcp-context-forge-postgres-1 psql -U postgres -d mcp -c "SELECT pg_stat_reset();"
# 2. Start MCP-specific load test in background
make load-test-mcp-protocol &
# 3. Capture container stats during load
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" \
| grep -E "gateway|postgres|redis|pgbouncer|nginx"
# 4. py-spy flamegraph on a gateway worker
WORKER_PID=$(docker top mcp-context-forge-gateway-1 | grep worker | head -1 | awk '{print $2}')
sudo py-spy record -o mcp_flamegraph.svg --pid $WORKER_PID -d 15 --subprocesses
# 5. Check which DB tables are hammered
docker exec mcp-context-forge-postgres-1 psql -U postgres -d mcp -c "
SELECT relname, seq_scan, seq_tup_read, idx_scan
FROM pg_stat_user_tables WHERE seq_tup_read > 0
ORDER BY seq_tup_read DESC LIMIT 10;"
# 6. Check Redis cache utilization
docker exec mcp-context-forge-redis-1 redis-cli info stats | grep -E "hits|misses"
What to look for:
- If PgBouncer/PostgreSQL CPU is high but Redis is idle, the MCP path is bypassing the cache layer.
- If gateway CPU is the constraint, look at middleware overhead (auth, RBAC, validation) in the flamegraph.
- If upstream MCP server CPU is the constraint, the bottleneck is in tool execution, not the gateway.
Common Performance Issues¶
Issue: High Sequential Scan Count¶
Symptom: seq_tup_read in billions
Causes:
- Missing index
- Non-selective filter (e.g., 7-day filter matches all recent data)
- Short cache TTL causing repeated queries
Solutions:
- Add covering index
- Increase cache TTL
- Add materialized view for aggregations
Issue: Many Idle-in-Transaction Connections¶
Symptom: 50+ connections in idle in transaction state
Causes:
- N+1 query patterns
- Long-running requests holding transactions
- Missing connection pool limits
Solutions:
- Use batch queries instead of loops
- Set
idle_in_transaction_session_timeout - Optimize slow queries
Issue: Health Check Endpoints Holding PgBouncer Connections¶
Symptom: SELECT 1 queries stuck in idle in transaction state for minutes
SELECT left(query, 50), count(*), avg(EXTRACT(EPOCH FROM (NOW() - state_change)))::int as avg_age
FROM pg_stat_activity
WHERE state = 'idle in transaction' AND datname = 'mcp'
GROUP BY left(query, 50);
query | count | avg_age
----------------------+-------+---------
SELECT 1 | 45 | 139
Causes:
- PgBouncer in
transactionmode holds backend connections untilCOMMIT/ROLLBACK - Health endpoints using
Depends(get_db)rely on dependency cleanup, which may not execute on timeout/cancellation async defendpoints calling blocking SQLAlchemy code on event loop thread- Cross-thread session usage when mixing
asyncio.to_threadwithDepends(get_db)
Solutions:
- Use dedicated sessions instead of
Depends(get_db)- Health endpoints should create and manage their own sessions to avoid double-commit and cross-thread issues:
@app.get("/health")
def healthcheck(): # Sync function - FastAPI runs in threadpool
"""Health check with dedicated session."""
db = SessionLocal()
try:
db.execute(text("SELECT 1"))
db.commit() # Explicitly release PgBouncer connection
return {"status": "healthy"}
except Exception as e:
try:
db.rollback()
except Exception:
try:
db.invalidate() # Remove broken connection from pool
except Exception:
pass
return {"status": "unhealthy", "error": str(e)}
finally:
db.close()
- Use sync functions for simple blocking operations - FastAPI automatically runs
def(sync) route handlers in a threadpool:
# BAD: async def with blocking calls stalls event loop
@app.get("/health")
async def healthcheck():
db.execute(text("SELECT 1")) # Blocks event loop!
# GOOD: sync def runs in threadpool automatically
@app.get("/health")
def healthcheck():
db.execute(text("SELECT 1")) # Runs in threadpool
- For async endpoints, create sessions inside
asyncio.to_thread- All DB operations must happen in the same thread:
@app.get("/ready")
async def readiness_check():
def _check_db() -> str | None:
# Session created IN the worker thread
db = SessionLocal()
try:
db.execute(text("SELECT 1"))
db.commit()
return None
except Exception as e:
try:
db.rollback()
except Exception:
try:
db.invalidate()
except Exception:
pass
return str(e)
finally:
db.close()
error = await asyncio.to_thread(_check_db)
if error:
return {"status": "not ready", "error": error}
return {"status": "ready"}
- Mirror
get_dbcleanup pattern - Use rollback → invalidate → close:
except Exception as e:
try:
db.rollback()
except Exception:
try:
db.invalidate() # Remove broken connection from pool
except Exception:
pass # nosec B110 - Best effort cleanup
Why not use Depends(get_db)?
get_dbcommits after yield, causing double-commit if endpoint commits- With
asyncio.to_thread, the session is created in one thread but used in another - Health endpoints should test actual DB connectivity, not be mockable via
dependency_overrides
Issue: High Gateway CPU¶
Symptom: Gateway at 600%+ CPU
Causes:
- Template rendering overhead (admin UI)
- JSON serialization of large responses
- Pydantic validation overhead
- Middleware overhead from enabled-but-unused features
- Too many gunicorn workers causing context switching
Solutions:
- Enable response caching
- Paginate large result sets
- Use orjson for serialization (enabled by default)
- Disable unused features (A2A, catalog, LLM chat, admin UI) — see disable unused features in the tuning guide
- Tune
GUNICORN_WORKERSto match CPU cores (not exceed them)
See Also¶
- Gateway Tuning Guide - Environment variables, session pool, connection pool tuning
- Database Performance Guide - N+1 detection and query logging
- Performance Architecture - MCP request path, caching layers, scaling capacity
- Performance Testing - Load testing with hey
- Scaling Guide - Production scaling configuration