ADR-033: Tool Lookup Cache for invoke_toolΒΆ
- Status: Accepted
- Date: 2025-01-20
- Deciders: Platform Team
ContextΒΆ
Load testing exposed a hot-path bottleneck in ToolService.invoke_tool: every tool invocation performs a DB lookup for the tool (and its gateway), even when the same tool is invoked repeatedly. This created:
- High database QPS proportional to tool invocations
- Connection pool saturation during slow upstream calls
- Elevated p95/p99 latency on high-concurrency tests
Existing registry/admin caches do not cover single-tool lookups, and invoke_tool cannot reuse registry list caches.
Related issues: #1940 (Tool lookup caching)
DecisionΒΆ
Introduce a two-tier tool lookup cache keyed by tool name, with:
- L1 in-memory LRU + TTL per worker
- Optional Redis L2 for multi-worker deployments
- Negative caching for missing/inactive/offline tools
- Explicit invalidation on tool and gateway mutations
Changes MadeΒΆ
- New module:
mcpgateway/cache/tool_lookup_cache.py - L1 LRU + TTL cache with size limit
- Redis L2 cache with shared keyspace when
CACHE_TYPE=redis - Negative cache entries (
missing,inactive,offline) -
Gateway-scoped invalidation using a Redis set of tool names
-
Invoke path integration
ToolService.invoke_tool()now checks cache before querying the DB- Cache payload includes tool + gateway fields needed for invocation
-
Negative cache entries short-circuit missing/inactive/offline tool calls
-
Cache invalidation
- Tool create/update/delete/toggle invalidates tool lookup cache
-
Gateway update/toggle/delete invalidates all tools for that gateway
-
Configuration
Tool Lookup Cache: - TOOL_LOOKUP_CACHE_ENABLED (default: true) - TOOL_LOOKUP_CACHE_TTL_SECONDS (default: 60) - TOOL_LOOKUP_CACHE_NEGATIVE_TTL_SECONDS (default: 10) - TOOL_LOOKUP_CACHE_L1_MAXSIZE (default: 10000) - TOOL_LOOKUP_CACHE_L2_ENABLED (default: true, only when CACHE_TYPE=redis)
Cache Key SchemeΒΆ
{prefix}tool_lookup:{tool_name} β tool + gateway payload
{prefix}tool_lookup:gateway:{gateway_id} β set of tool names (for invalidation)
Default prefix: mcpgw: β mcpgw:tool_lookup:my_tool
Performance OptimizationsΒΆ
Before (Baseline)ΒΆ
| Metric | Value |
|---|---|
| DB lookups per tool invocation | 1 |
| Invoke latency (cache miss) | DB-bound |
After (With Caching)ΒΆ
| Metric | Cache Hit | Cache Miss |
|---|---|---|
| DB lookups per tool invocation | 0 | 1 |
| Invoke latency | ~1-3ms | DB-bound |
Expected improvement: 80-95% reduction in DB traffic for repeated tool invocations.
ConsequencesΒΆ
PositiveΒΆ
- Removes hot-path DB lookups for repeat tool invocations
- Reduces connection pool pressure under high concurrency
- Redis L2 provides cross-worker cache reuse
NegativeΒΆ
- Cache staleness window (up to TTL) after tool/gateway updates
- Additional memory use in each worker for L1 cache
- Requires careful invalidation on tool/gateway mutations
NeutralΒΆ
- L2 is optional and only enabled when
CACHE_TYPE=redis - Negative cache TTL is short to avoid long-lived false negatives
- No schema changes required
Alternatives ConsideredΒΆ
| Option | Why Not |
|---|---|
| Rely on registry list cache | Not usable for single tool lookups |
| Cache at router layer only | Still requires DB lookup per invocation |
| Longer TTLs | Too stale for active tool updates |
| Materialized view | Overkill and DB-specific |
Compatibility NotesΒΆ
- Enabled by default, can be disabled via env vars
- Works with existing cache backend configuration
- No API or schema changes required
ReferencesΒΆ
- GitHub Issue #1940: Tool lookup caching
- ADR-007: Pluggable cache backend
- ADR-029: Registry and Admin Stats Caching
mcpgateway/cache/tool_lookup_cache.py- Implementationmcpgateway/services/tool_service.py- Integration
StatusΒΆ
Implemented and enabled by default. Monitor Redis keyspace for tool_lookup keys.