Skip to content

ADR-0009: Built-in Health Checks & Self-MonitoringΒΆ

  • Status: Accepted
  • Date: 2025-02-21
  • Deciders: Core Engineering Team

ContextΒΆ

MCP Gateways must participate in mesh/federated deployments. Faulty nodes must be detected and removed automatically. Additionally, cloud-native infrastructure (like Kubernetes, Docker Swarm, or systemd watchdogs) needs a way to check local health.

The gateway config supports health-related settings:

  • HEALTH_CHECK_INTERVAL: frequency of peer checks
  • HEALTH_CHECK_TIMEOUT: request timeout per probe
  • UNHEALTHY_THRESHOLD: number of failures before a peer is marked unhealthy

The README and architecture describe /health and /metrics endpoints as built-in features

DecisionΒΆ

Implement two health-check levels:

  1. Local health endpoint at /health:
  2. Verifies database connectivity and response time
  3. Optionally checks cache (e.g. Redis ping or in-memory status)

  4. Federated peer liveness:

  5. Every HEALTH_CHECK_INTERVAL, we ping all registered peers via HTTP
  6. If a peer fails UNHEALTHY_THRESHOLD times consecutively, it's tagged as 'Offline' i.e. The gateway is unreachable. Once its back online, it's automatically tagged as 'Active'
  7. A separate background task handles this (see GatewayService)

Health info is also published to /metrics in Prometheus format.

ConsequencesΒΆ

  • βœ… Federated topologies can eject bad nodes quickly and re-accept them later
  • βœ… Local health can be used by Kubernetes probes, HAProxy, etc.
  • πŸ”„ Gateways that go offline briefly won't be removed immediately (tunable)
  • πŸ” Metrics include last check time, RTT, and result status

Alternatives ConsideredΒΆ

Option Why Not
No health checks Delayed or no reaction to failures; requires manual debugging
Rely on Kubernetes probes Only detects local process health, not remote peers
External APM agent (Datadog) Complex setup, costly for small/self-hosted use cases
Central heartbeat server Single point of failure, requires extra infra

StatusΒΆ

This is implemented as part of the GatewayService and exposed via /health and /metrics endpoints.