Skip to content

ADR-029: Registry and Admin Stats CachingΒΆ

  • Status: Accepted
  • Date: 2025-01-15
  • Deciders: Platform Team

ContextΒΆ

Under high-concurrency load testing, two additional performance bottlenecks were identified beyond authentication (addressed in ADR-028):

  1. Registry List Endpoints: Tools, prompts, resources, agents, servers, and gateways list endpoints each query the database on every request. With pagination, filtering, and team-based access control, these queries became expensive under load.

  2. Admin Dashboard Stats: The admin dashboard aggregates statistics from multiple tables (tools, prompts, resources, servers, users, teams) with expensive COUNT queries executed on every page load.

At 1000+ concurrent users: - Registry list endpoints: ~50-100ms per request due to complex JOIN queries - Admin stats: ~200-500ms per request aggregating across tables - N+1 query patterns in team name resolution

Related issues: #1680 (Distributed Registry & Admin Stats Caching)

DecisionΒΆ

Implement distributed caching for registry list endpoints and admin dashboard statistics, following the same hybrid Redis + in-memory pattern established in ADR-028.

Changes MadeΒΆ

  1. New module: mcpgateway/cache/registry_cache.py
  2. RegistryCache class with Redis primary + in-memory fallback
  3. Per-entity-type TTL configuration (tools, prompts, resources, agents, servers, gateways)
  4. Filter-aware cache keys (tags, include_inactive, pagination cursor)
  5. Automatic invalidation on CRUD operations

  6. New module: mcpgateway/cache/admin_stats_cache.py

  7. AdminStatsCache class with Redis primary + in-memory fallback
  8. Separate TTLs for system stats, observability, users, and teams
  9. Cached versions of expensive aggregate queries

  10. N+1 Query Fixes

  11. prompt_service.py:list_prompts() - Batch team name fetching
  12. resource_service.py:list_resources() - Batch team name fetching
  13. Single query fetches all team names for a page of results

  14. Cache Integration in Services

  15. tool_service.py:list_tools() - Cache first page results
  16. prompt_service.py:list_prompts() - Cache first page results
  17. resource_service.py:list_resources() - Cache first page results
  18. a2a_service.py:list_agents() - Cache agent listings
  19. server_service.py:list_servers() - Cache server listings
  20. gateway_service.py:list_gateway_peers() - Cache gateway listings
  21. system_stats_service.py:get_comprehensive_stats_cached() - Cached stats

  22. Cache Invalidation Hooks

  23. Tool create/update/delete triggers cache.invalidate_tools()
  24. Prompt create/update/delete triggers cache.invalidate_prompts()
  25. Resource create/update/delete triggers cache.invalidate_resources()
  26. Similar patterns for agents, servers, gateways

  27. Configuration Settings

Registry Cache: - REGISTRY_CACHE_ENABLED (default: true) - REGISTRY_CACHE_TOOLS_TTL (default: 20s) - REGISTRY_CACHE_PROMPTS_TTL (default: 15s) - REGISTRY_CACHE_RESOURCES_TTL (default: 15s) - REGISTRY_CACHE_AGENTS_TTL (default: 20s) - REGISTRY_CACHE_SERVERS_TTL (default: 20s) - REGISTRY_CACHE_GATEWAYS_TTL (default: 20s)

Admin Stats Cache: - ADMIN_STATS_CACHE_ENABLED (default: true) - ADMIN_STATS_CACHE_SYSTEM_TTL (default: 60s) - ADMIN_STATS_CACHE_OBSERVABILITY_TTL (default: 30s)

Cache Key SchemeΒΆ

Registry cache keys include filter hashes for cache differentiation:

{prefix}registry:tools:{filters_hash}        β†’ Serialized tools list + cursor
{prefix}registry:prompts:{filters_hash}      β†’ Serialized prompts list + cursor
{prefix}registry:resources:{filters_hash}    β†’ Serialized resources list + cursor
{prefix}registry:agents:{filters_hash}       β†’ Serialized agents list + cursor
{prefix}registry:servers:{filters_hash}      β†’ Serialized servers list + cursor
{prefix}registry:gateways:{filters_hash}     β†’ Serialized gateways list + cursor

Admin stats cache keys:

{prefix}stats:system         β†’ System stats JSON
{prefix}stats:observability  β†’ Observability metrics JSON
{prefix}stats:users          β†’ Users list JSON
{prefix}stats:teams          β†’ Teams list JSON

Default prefix: mcpgw: β†’ mcpgw:registry:tools:abc123

Caching StrategyΒΆ

  • First page only: Only the first page (cursor=None) of results is cached to maximize hit rate
  • Filter-aware: Different filter combinations get different cache entries
  • Automatic invalidation: CRUD operations invalidate entire entity type cache
  • Graceful fallback: Database queries still work if cache unavailable

Performance OptimizationsΒΆ

Before (Baseline)ΒΆ

Metric Value
List tools latency P50 ~50ms
Admin dashboard load ~300ms
N+1 queries per list 10-50 (one per result)

After (With Caching)ΒΆ

Metric Cache Hit Cache Miss
List tools latency P50 ~2-5ms ~50ms
Admin dashboard load ~5-10ms ~300ms
N+1 queries per list 0 1 (batch fetch)

Expected improvement: 80-95% reduction for cached requests

ConsequencesΒΆ

PositiveΒΆ

  • Significant reduction in database load for read-heavy workloads
  • Faster admin dashboard rendering
  • Eliminated N+1 query patterns
  • Better user experience for browsing registry
  • Redis cache enables shared state across workers

NegativeΒΆ

  • Cache staleness window (up to TTL) after modifications:
  • Registry changes: up to 15-20s
  • Stats changes: up to 30-60s
  • Additional memory usage for in-memory fallback cache
  • Complexity in cache key management for filtered queries

NeutralΒΆ

  • Only first page cached (pagination beyond first page always hits DB)
  • TTL values are configurable if defaults don't fit use case
  • Feature is backward-compatible (defaults to enabled)

Alternatives ConsideredΒΆ

Option Why Not
Cache all pages Low hit rate, high memory usage
Longer TTLs Too stale for active registries
Query result caching at DB level Less control, doesn't help N+1
Materialized views PostgreSQL-specific, complex maintenance

Compatibility NotesΒΆ

  • Features are enabled by default
  • Can be disabled without code changes via environment variables
  • No database schema changes required
  • Works with existing pagination patterns

ReferencesΒΆ

  • GitHub Issue #1680: Distributed Registry & Admin Stats Caching
  • ADR-028: Authentication Data Caching (establishes pattern)
  • mcpgateway/cache/registry_cache.py - Registry cache implementation
  • mcpgateway/cache/admin_stats_cache.py - Admin stats cache implementation
  • mcpgateway/services/*_service.py - Integration points

StatusΒΆ

Implemented and enabled by default. Monitor cache hit rates via stats endpoint.