Files
flyer-crawler.projectium.com/docs/adr/0020-health-checks-and-liveness-readiness-probes.md
Torben Sorensen 4a04e478c4
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 16m58s
integration test fixes - claude for the win? try 4 - i have a good feeling
2026-01-09 05:56:19 -08:00

6.3 KiB

ADR-020: Health Checks and Liveness/Readiness Probes

Date: 2025-12-12

Status: Accepted

Implemented: 2026-01-09

Context

When the application is containerized (ADR-014), the container orchestrator (e.g., Kubernetes, Docker Swarm) needs a way to determine if the application is running correctly. Without this, it cannot manage application lifecycle events like restarts or rolling updates effectively.

Decision

We will implement dedicated health check endpoints in the Express application.

  • A Liveness Probe (/api/health/live) will return a 200 OK to indicate the server is running. If it fails, the orchestrator should restart the container.

  • A Readiness Probe (/api/health/ready) will return a 200 OK only if the application is ready to accept traffic (e.g., database connection is established). If it fails, the orchestrator will temporarily remove the container from the load balancer.

Consequences

  • Positive: Enables robust, automated application lifecycle management in a containerized environment. Prevents traffic from being sent to unhealthy or uninitialized application instances.
  • Negative: Adds a small amount of code for the health check endpoints. Requires configuration in the container orchestration layer.

Implementation Status

What's Implemented

  • Liveness Probe (/api/health/live) - Simple process health check
  • Readiness Probe (/api/health/ready) - Comprehensive dependency health check
  • Startup Probe (/api/health/startup) - Initial startup verification
  • Individual Service Checks - Database, Redis, Storage endpoints
  • Detailed Health Response - Service latency, status, and details

Implementation Details

Probe Endpoints

Endpoint Purpose Checks HTTP Status
/api/health/live Liveness probe Process running 200 = alive
/api/health/ready Readiness probe DB, Redis, Storage 200 = ready, 503 = not ready
/api/health/startup Startup probe Database only 200 = started, 503 = starting

Liveness Probe

The liveness probe is intentionally simple with no external dependencies:

// GET /api/health/live
{
  "status": "ok",
  "timestamp": "2026-01-09T12:00:00.000Z"
}

Usage: If this endpoint fails to respond, the container should be restarted.

Readiness Probe

The readiness probe checks all critical dependencies:

// GET /api/health/ready
{
  "status": "healthy",  // healthy | degraded | unhealthy
  "timestamp": "2026-01-09T12:00:00.000Z",
  "uptime": 3600.5,
  "services": {
    "database": {
      "status": "healthy",
      "latency": 5,
      "details": {
        "totalConnections": 10,
        "idleConnections": 8,
        "waitingConnections": 0
      }
    },
    "redis": {
      "status": "healthy",
      "latency": 2
    },
    "storage": {
      "status": "healthy",
      "latency": 1,
      "details": {
        "path": "/var/www/.../flyer-images"
      }
    }
  }
}

Status Logic:

  • healthy - All critical services (database, Redis) are healthy
  • degraded - Some non-critical issues (high connection wait, storage issues)
  • unhealthy - Critical service unavailable (returns 503)

Startup Probe

The startup probe is used during container initialization:

// GET /api/health/startup
// Success (200):
{
  "status": "started",
  "timestamp": "2026-01-09T12:00:00.000Z",
  "database": { "status": "healthy", "latency": 5 }
}

// Still starting (503):
{
  "status": "starting",
  "message": "Waiting for database connection",
  "database": { "status": "unhealthy", "message": "..." }
}

Individual Service Endpoints

For detailed diagnostics:

Endpoint Purpose
/api/health/ping Simple server responsiveness
/api/health/db-schema Verify database tables exist
/api/health/db-pool Database connection pool status
/api/health/redis Redis connectivity
/api/health/storage File storage accessibility
/api/health/time Server time synchronization

Kubernetes Configuration Example

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: flyer-crawler
      livenessProbe:
        httpGet:
          path: /api/health/live
          port: 3001
        initialDelaySeconds: 10
        periodSeconds: 15
        failureThreshold: 3

      readinessProbe:
        httpGet:
          path: /api/health/ready
          port: 3001
        initialDelaySeconds: 5
        periodSeconds: 10
        failureThreshold: 3

      startupProbe:
        httpGet:
          path: /api/health/startup
          port: 3001
        initialDelaySeconds: 0
        periodSeconds: 5
        failureThreshold: 30 # Allow up to 150 seconds for startup

Docker Compose Configuration Example

services:
  api:
    image: flyer-crawler:latest
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:3001/api/health/ready']
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

PM2 Configuration Example

For non-containerized deployments using PM2:

// ecosystem.config.js
module.exports = {
  apps: [
    {
      name: 'flyer-crawler',
      script: 'dist/server.js',
      // PM2 will check this endpoint
      // and restart if it fails
      health_check: {
        url: 'http://localhost:3001/api/health/ready',
        interval: 30000,
        timeout: 10000,
      },
    },
  ],
};

Key Files

  • src/routes/health.routes.ts - Health check endpoint implementations
  • server.ts - Health routes mounted at /api/health

Service Health Thresholds

Service Healthy Degraded Unhealthy
Database Responds to SELECT 1 > 3 waiting connections Connection fails
Redis PING returns PONG N/A Connection fails
Storage Write access to path N/A Path not accessible