CPU Spin Loop Mitigation GuideΒΆ
This guide documents the CPU spin loop issue affecting MCP Gateway and the multi-layered mitigation strategy implemented to address it.
OverviewΒΆ
Issue: #2360 - Gunicorn workers consume 100% CPU after load tests Root Cause: anyio#695 - _deliver_cancellation infinite loop Affected Versions: All versions using anyio with MCP SDK Status: Mitigated via configuration; awaiting upstream fix
Problem DescriptionΒΆ
Under certain conditions (typically after high-load benchmarks or sustained traffic), Gunicorn workers can enter a state where they consume 90-100% CPU while appearing idle. This happens when:
- An SSE/MCP connection is cancelled (client disconnect, timeout, etc.)
- Internal tasks spawned by the MCP SDK don't respond to
CancelledError - anyio's
_deliver_cancellationmethod enters an infinite loop trying to cancel these tasks - The loop calls
call_soon()repeatedly, scheduling callbacks that never complete
SymptomsΒΆ
- Workers at 90-100% CPU with no active requests
py-spyshows spinning inanyio/_backends/_asyncio.py:569-580straceshows rapidepoll_waitwith 0ms timeout- Issue persists until worker is recycled or restarted
Root Cause AnalysisΒΆ
The _deliver_cancellation method in anyio's CancelScope has no iteration limit:
# anyio/_backends/_asyncio.py (simplified)
def _deliver_cancellation(self, origin: CancelScope) -> bool:
# ... check if task should be cancelled ...
if should_retry:
self._host_task._task_state.cancel_scope._cancel_handle = (
get_running_loop().call_soon(
self._deliver_cancellation, origin # Recursive scheduling
)
)
return False
When tasks don't properly handle cancellation (e.g., MCP SDK's post_writer waiting on MemoryObjectSendStream), this creates an infinite loop.
Mitigation StrategyΒΆ
We implement a defense-in-depth approach with three layers:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 1: Prevention β
β Detect and close dead connections early β
β SSE_SEND_TIMEOUT, SSE_RAPID_YIELD_WINDOW_MS/MAX β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 2: Containment β
β Limit cleanup wait time for stuck tasks β
β MCP_SESSION_POOL_CLEANUP_TIMEOUT, SSE_TASK_GROUP_... β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 3: Last Resort (Experimental) β
β Force-terminate infinite cancellation loops β
β ANYIO_CANCEL_DELIVERY_PATCH_ENABLED/MAX_ITERATIONS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 4: Recovery β
β Worker recycling cleans up orphaned tasks β
β GUNICORN_MAX_REQUESTS, GUNICORN_MAX_REQUESTS_JITTER β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Layer 1: SSE Connection ProtectionΒΆ
Detect and close dead SSE connections before they can trigger spin loops.
Configuration VariablesΒΆ
| Variable | Default | Description |
|---|---|---|
SSE_SEND_TIMEOUT | 30.0 | Timeout in seconds for ASGI send() calls. Protects against connections that accept data but never complete. Set to 0 to disable. |
SSE_RAPID_YIELD_WINDOW_MS | 1000 | Time window (milliseconds) for detecting rapid yield patterns that indicate a dead client. |
SSE_RAPID_YIELD_MAX | 50 | Maximum number of yields allowed within the detection window. If exceeded, connection is closed. Set to 0 to disable. |
How It WorksΒΆ
When an SSE client disconnects without properly closing the connection, the server may continue trying to send data. The rapid yield detection monitors how often the event loop yields during SSE operations:
# Simplified detection logic
if yields_in_window > SSE_RAPID_YIELD_MAX:
logger.warning("Client appears disconnected, closing SSE connection")
raise ClientDisconnected()
Recommended SettingsΒΆ
# Default (balanced)
SSE_SEND_TIMEOUT=30.0
SSE_RAPID_YIELD_WINDOW_MS=1000
SSE_RAPID_YIELD_MAX=50
# Aggressive (faster detection, may have false positives on slow networks)
SSE_SEND_TIMEOUT=10.0
SSE_RAPID_YIELD_WINDOW_MS=500
SSE_RAPID_YIELD_MAX=25
Layer 2: Cleanup TimeoutsΒΆ
Limit how long the gateway waits for stuck tasks during connection cleanup.
Configuration VariablesΒΆ
| Variable | Default | Description |
|---|---|---|
MCP_SESSION_POOL_CLEANUP_TIMEOUT | 5.0 | Timeout in seconds for session.__aexit__() when closing pooled MCP sessions. |
SSE_TASK_GROUP_CLEANUP_TIMEOUT | 5.0 | Timeout in seconds for SSE task group cleanup when connections are cancelled. |
How It WorksΒΆ
Instead of waiting indefinitely for tasks to respond to cancellation:
# Before (could spin forever)
await pooled.session.__aexit__(None, None, None)
# After (bounded wait)
with anyio.move_on_after(cleanup_timeout) as scope:
await pooled.session.__aexit__(None, None, None)
if scope.cancelled_caught:
logger.warning("Session cleanup timed out after %s seconds", cleanup_timeout)
Recommended SettingsΒΆ
# Default (reliable cleanup)
MCP_SESSION_POOL_CLEANUP_TIMEOUT=5.0
SSE_TASK_GROUP_CLEANUP_TIMEOUT=5.0
# Aggressive (faster recovery from spin loops)
MCP_SESSION_POOL_CLEANUP_TIMEOUT=0.5
SSE_TASK_GROUP_CLEANUP_TIMEOUT=0.5
# Conservative (for slow MCP servers)
MCP_SESSION_POOL_CLEANUP_TIMEOUT=10.0
SSE_TASK_GROUP_CLEANUP_TIMEOUT=10.0
Trade-offsΒΆ
| Setting | Pros | Cons |
|---|---|---|
| Low (0.5-2s) | Fast spin loop recovery | May interrupt legitimate cleanup |
| Default (5s) | Balanced | Spin loops persist for 5s before timeout |
| High (10s+) | Reliable cleanup | Longer CPU waste during spin loops |
Layer 3: anyio Monkey-Patch (Experimental)ΒΆ
Last resort: directly patch anyio to limit _deliver_cancellation iterations.
Configuration VariablesΒΆ
| Variable | Default | Description |
|---|---|---|
ANYIO_CANCEL_DELIVERY_PATCH_ENABLED | false | Enable the monkey-patch. Only enable if Layers 1-2 don't resolve the issue. |
ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS | 100 | Maximum iterations before forcing termination of the cancellation loop. |
How It WorksΒΆ
The patch wraps anyio's _deliver_cancellation method to add an iteration counter:
def _patched_deliver_cancellation(self, origin):
if not hasattr(origin, "_delivery_iterations"):
origin._delivery_iterations = 0
origin._delivery_iterations += 1
if origin._delivery_iterations > max_iterations:
logger.warning("anyio cancel delivery exceeded %d iterations, forcing termination", max_iterations)
# Clear the cancel handle to break the loop
if hasattr(self, "_cancel_handle") and self._cancel_handle is not None:
self._cancel_handle = None
return False
return original_deliver_cancellation(self, origin)
WarningsΒΆ
Experimental Feature
This patch modifies internal anyio behavior and may:
- Leave some tasks uncancelled (usually harmless, cleaned up by worker recycling)
- Be removed when anyio or MCP SDK fix the upstream issue
- Have unknown interactions with future anyio versions
Recommended SettingsΒΆ
# Disabled by default (use Layers 1-2 first)
ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=false
# If needed, start conservative
ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true
ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=100
# More aggressive (faster termination, more orphaned tasks)
ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true
ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=50
Layer 4: Worker RecyclingΒΆ
Gunicorn worker recycling provides a safety net for any orphaned tasks.
Configuration VariablesΒΆ
| Variable | Default | Description |
|---|---|---|
GUNICORN_MAX_REQUESTS | 5000 | Recycle workers after this many requests. |
GUNICORN_MAX_REQUESTS_JITTER | 500 | Random jitter to prevent thundering herd. |
Recommended SettingsΒΆ
# Production (regular recycling)
GUNICORN_MAX_REQUESTS=5000
GUNICORN_MAX_REQUESTS_JITTER=500
# High-traffic (more frequent recycling)
GUNICORN_MAX_REQUESTS=2000
GUNICORN_MAX_REQUESTS_JITTER=200
Complete Configuration ExamplesΒΆ
Balanced (Recommended for Production)ΒΆ
# Layer 1: SSE Protection
SSE_SEND_TIMEOUT=30.0
SSE_RAPID_YIELD_WINDOW_MS=1000
SSE_RAPID_YIELD_MAX=50
# Layer 2: Cleanup Timeouts
MCP_SESSION_POOL_CLEANUP_TIMEOUT=5.0
SSE_TASK_GROUP_CLEANUP_TIMEOUT=5.0
# Layer 3: Disabled (use only if needed)
ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=false
# Layer 4: Worker Recycling
GUNICORN_MAX_REQUESTS=5000
GUNICORN_MAX_REQUESTS_JITTER=500
Aggressive (For Known Spin Loop Issues)ΒΆ
# Layer 1: Faster detection
SSE_SEND_TIMEOUT=10.0
SSE_RAPID_YIELD_WINDOW_MS=500
SSE_RAPID_YIELD_MAX=25
# Layer 2: Short timeouts
MCP_SESSION_POOL_CLEANUP_TIMEOUT=0.5
SSE_TASK_GROUP_CLEANUP_TIMEOUT=0.5
# Layer 3: Enabled
ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true
ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=50
# Layer 4: Frequent recycling
GUNICORN_MAX_REQUESTS=2000
GUNICORN_MAX_REQUESTS_JITTER=200
Conservative (For Slow Networks/Servers)ΒΆ
# Layer 1: Generous timeouts
SSE_SEND_TIMEOUT=60.0
SSE_RAPID_YIELD_WINDOW_MS=2000
SSE_RAPID_YIELD_MAX=100
# Layer 2: Allow time for cleanup
MCP_SESSION_POOL_CLEANUP_TIMEOUT=10.0
SSE_TASK_GROUP_CLEANUP_TIMEOUT=10.0
# Layer 3: Disabled
ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=false
# Layer 4: Standard recycling
GUNICORN_MAX_REQUESTS=5000
GUNICORN_MAX_REQUESTS_JITTER=500
Diagnosing Spin LoopsΒΆ
Using py-spyΒΆ
# Attach to spinning worker
py-spy top --pid <worker_pid>
# Look for high CPU in:
# - anyio/_backends/_asyncio.py:569-580 (_deliver_cancellation)
# - anyio/_backends/_asyncio.py:CancelScope._deliver_cancellation
Using straceΒΆ
# Attach to spinning worker
strace -p <worker_pid> -e epoll_wait
# Spin loop signature: rapid epoll_wait with 0ms timeout
# epoll_wait(5, [], 1024, 0) = 0
# epoll_wait(5, [], 1024, 0) = 0
# epoll_wait(5, [], 1024, 0) = 0
Using top/htopΒΆ
# Look for workers at 90-100% CPU with no active requests
top -p $(pgrep -d, -f "gunicorn.*mcpgateway")
Files ChangedΒΆ
The following files were modified to implement this mitigation:
Core ImplementationΒΆ
| File | Changes |
|---|---|
mcpgateway/config.py | Added configuration variables for all layers |
mcpgateway/transports/sse_transport.py | Added anyio monkey-patch (Layer 3) |
mcpgateway/services/mcp_session_pool.py | Added cleanup timeout (Layer 2) |
mcpgateway/translate.py | Added cleanup timeout for streamable HTTP |
Configuration FilesΒΆ
| File | Changes |
|---|---|
.env.example | Documented all mitigation variables |
docker-compose.yml | Added mitigation section with all variables |
charts/mcp-stack/values.yaml | Added mitigation section with all variables |
README.md | Added "CPU Spin Loop Mitigation" section |
Related Pull RequestsΒΆ
- PR #XXXX: Initial cleanup timeout implementation
- PR #XXXX: Added SSE rapid yield detection
- PR #XXXX: Added optional anyio monkey-patch
- PR #XXXX: Consolidated documentation
Future WorkΒΆ
- Upstream Fix: Monitor anyio#695 for resolution
- MCP SDK: Consider contributing fix to MCP SDK for proper task cancellation handling
- Remove Workarounds: Once upstream is fixed, remove Layer 3 monkey-patch
ReferencesΒΆ
- GitHub Issue #2360 - Original bug report
- anyio Issue #695 - Upstream issue
- anyio CancelScope source - Where the spin occurs