Skip to content

ADR-0010: Observability via Prometheus, Structured Logs, and Metrics

  • Status: Accepted
  • Date: 2025-02-21
  • Deciders: Core Engineering Team

Context

The MCP Gateway is a long-running service that executes tools, processes requests, and federates with remote peers. Operators and developers must be able to observe:

  • Overall system health
  • Request throughput and latency
  • Tool and resource usage
  • Error rates and failure patterns
  • Federation behavior and peer availability

The gateway needs to surface this without requiring external instrumentation or agents.

Decision

We will implement native observability features using:

  1. Structured JSON logs with optional plaintext fallback:
  2. Controlled by LOG_FORMAT=json|text and LOG_LEVEL
  3. Includes fields: timestamp, level, logger name, request ID, route, auth user, latency

  4. Prometheus-compatible /metrics endpoint:

  5. Exposes key counters and histograms: tool invocations, failures, resource loads, peer syncs, etc.
  6. Uses plain text/plain; version=0.0.4 exposition format

  7. Latency decorators and in-code timing for critical paths:

  8. Completion requests
  9. Resource resolution
  10. Federation sync/health probes

  11. Per-request IDs and correlation:

  12. Middleware attaches X-Request-ID if present or generates a new one
  13. Request ID propagates through logs and errors

Consequences

  • πŸ“Š Metrics can be scraped by Prometheus and visualized in Grafana
  • πŸ” Developers can trace logs by request or user
  • πŸ› οΈ No external sidecars required for basic visibility
  • πŸ“¦ Docker image contains /metrics by default and logs to stdout (JSON)

Alternatives Considered

Option Why Not
No structured logging Difficult to parse or filter logs; weak correlation per request
Third-party APM (e.g., Datadog) Adds vendor lock-in, overhead, and cost
Syslog or Fluentd only Requires extra deployment layers; still needs JSON emitters
StatsD / Telegraf metrics Less common today than Prometheus; harder to self-host

Status

Implemented in LoggingService and metrics_router. Observability is active by default for all transports and routes.