# ADR-020: Health Checks and Liveness/Readiness Probes **Date**: 2025-12-12 **Status**: Accepted **Implemented**: 2026-01-09 ## Context When the application is containerized (`ADR-014`), the container orchestrator (e.g., Kubernetes, Docker Swarm) needs a way to determine if the application is running correctly. Without this, it cannot manage application lifecycle events like restarts or rolling updates effectively. ## Decision We will implement dedicated health check endpoints in the Express application. - A **Liveness Probe** (`/api/health/live`) will return a `200 OK` to indicate the server is running. If it fails, the orchestrator should restart the container. - A **Readiness Probe** (`/api/health/ready`) will return a `200 OK` only if the application is ready to accept traffic (e.g., database connection is established). If it fails, the orchestrator will temporarily remove the container from the load balancer. ## Consequences - **Positive**: Enables robust, automated application lifecycle management in a containerized environment. Prevents traffic from being sent to unhealthy or uninitialized application instances. - **Negative**: Adds a small amount of code for the health check endpoints. Requires configuration in the container orchestration layer. ## Implementation Status ### What's Implemented - ✅ **Liveness Probe** (`/api/health/live`) - Simple process health check - ✅ **Readiness Probe** (`/api/health/ready`) - Comprehensive dependency health check - ✅ **Startup Probe** (`/api/health/startup`) - Initial startup verification - ✅ **Individual Service Checks** - Database, Redis, Storage endpoints - ✅ **Detailed Health Response** - Service latency, status, and details ## Implementation Details ### Probe Endpoints | Endpoint | Purpose | Checks | HTTP Status | | --------------------- | --------------- | ------------------ | ----------------------------- | | `/api/health/live` | Liveness probe | Process running | 200 = alive | | `/api/health/ready` | Readiness probe | DB, Redis, Storage | 200 = ready, 503 = not ready | | `/api/health/startup` | Startup probe | Database only | 200 = started, 503 = starting | ### Liveness Probe The liveness probe is intentionally simple with no external dependencies: ```typescript // GET /api/health/live { "status": "ok", "timestamp": "2026-01-09T12:00:00.000Z" } ``` **Usage**: If this endpoint fails to respond, the container should be restarted. ### Readiness Probe The readiness probe checks all critical dependencies: ```typescript // GET /api/health/ready { "status": "healthy", // healthy | degraded | unhealthy "timestamp": "2026-01-09T12:00:00.000Z", "uptime": 3600.5, "services": { "database": { "status": "healthy", "latency": 5, "details": { "totalConnections": 10, "idleConnections": 8, "waitingConnections": 0 } }, "redis": { "status": "healthy", "latency": 2 }, "storage": { "status": "healthy", "latency": 1, "details": { "path": "/var/www/.../flyer-images" } } } } ``` **Status Logic**: - `healthy` - All critical services (database, Redis) are healthy - `degraded` - Some non-critical issues (high connection wait, storage issues) - `unhealthy` - Critical service unavailable (returns 503) ### Startup Probe The startup probe is used during container initialization: ```typescript // GET /api/health/startup // Success (200): { "status": "started", "timestamp": "2026-01-09T12:00:00.000Z", "database": { "status": "healthy", "latency": 5 } } // Still starting (503): { "status": "starting", "message": "Waiting for database connection", "database": { "status": "unhealthy", "message": "..." } } ``` ### Individual Service Endpoints For detailed diagnostics: | Endpoint | Purpose | | ----------------------- | ------------------------------- | | `/api/health/ping` | Simple server responsiveness | | `/api/health/db-schema` | Verify database tables exist | | `/api/health/db-pool` | Database connection pool status | | `/api/health/redis` | Redis connectivity | | `/api/health/storage` | File storage accessibility | | `/api/health/time` | Server time synchronization | ## Kubernetes Configuration Example ```yaml apiVersion: v1 kind: Pod spec: containers: - name: flyer-crawler livenessProbe: httpGet: path: /api/health/live port: 3001 initialDelaySeconds: 10 periodSeconds: 15 failureThreshold: 3 readinessProbe: httpGet: path: /api/health/ready port: 3001 initialDelaySeconds: 5 periodSeconds: 10 failureThreshold: 3 startupProbe: httpGet: path: /api/health/startup port: 3001 initialDelaySeconds: 0 periodSeconds: 5 failureThreshold: 30 # Allow up to 150 seconds for startup ``` ## Docker Compose Configuration Example ```yaml services: api: image: flyer-crawler:latest healthcheck: test: ['CMD', 'curl', '-f', 'http://localhost:3001/api/health/ready'] interval: 30s timeout: 10s retries: 3 start_period: 40s ``` ## PM2 Configuration Example For non-containerized deployments using PM2: ```javascript // ecosystem.config.js module.exports = { apps: [ { name: 'flyer-crawler', script: 'dist/server.js', // PM2 will check this endpoint // and restart if it fails health_check: { url: 'http://localhost:3001/api/health/ready', interval: 30000, timeout: 10000, }, }, ], }; ``` ## Key Files - `src/routes/health.routes.ts` - Health check endpoint implementations - `server.ts` - Health routes mounted at `/api/health` ## Service Health Thresholds | Service | Healthy | Degraded | Unhealthy | | -------- | ---------------------- | ----------------------- | ------------------- | | Database | Responds to `SELECT 1` | > 3 waiting connections | Connection fails | | Redis | `PING` returns `PONG` | N/A | Connection fails | | Storage | Write access to path | N/A | Path not accessible |