flyer-crawler.projectium.com/docs/adr/0020-health-checks-and-liveness-readiness-probes.md

# ADR-020: Health Checks and Liveness/Readiness Probes

**Date**: 2025-12-12

**Status**: Accepted

**Implemented**: 2026-01-09

## Context

When the application is containerized (`ADR-014`), the container orchestrator (e.g., Kubernetes, Docker Swarm) needs a way to determine if the application is running correctly. Without this, it cannot manage application lifecycle events like restarts or rolling updates effectively.

## Decision

We will implement dedicated health check endpoints in the Express application.

- A **Liveness Probe** (`/api/health/live`) will return a `200 OK` to indicate the server is running. If it fails, the orchestrator should restart the container.

- A **Readiness Probe** (`/api/health/ready`) will return a `200 OK` only if the application is ready to accept traffic (e.g., database connection is established). If it fails, the orchestrator will temporarily remove the container from the load balancer.

## Consequences

- **Positive**: Enables robust, automated application lifecycle management in a containerized environment. Prevents traffic from being sent to unhealthy or uninitialized application instances.
- **Negative**: Adds a small amount of code for the health check endpoints. Requires configuration in the container orchestration layer.

## Implementation Status

### What's Implemented

- ✅ **Liveness Probe** (`/api/health/live`) - Simple process health check
- ✅ **Readiness Probe** (`/api/health/ready`) - Comprehensive dependency health check
- ✅ **Startup Probe** (`/api/health/startup`) - Initial startup verification
- ✅ **Individual Service Checks** - Database, Redis, Storage endpoints
- ✅ **Detailed Health Response** - Service latency, status, and details

## Implementation Details

### Probe Endpoints

| Endpoint              | Purpose         | Checks             | HTTP Status                   |
| --------------------- | --------------- | ------------------ | ----------------------------- |
| `/api/health/live`    | Liveness probe  | Process running    | 200 = alive                   |
| `/api/health/ready`   | Readiness probe | DB, Redis, Storage | 200 = ready, 503 = not ready  |
| `/api/health/startup` | Startup probe   | Database only      | 200 = started, 503 = starting |

### Liveness Probe

The liveness probe is intentionally simple with no external dependencies:

```typescript
// GET /api/health/live
{
  "status": "ok",
  "timestamp": "2026-01-09T12:00:00.000Z"
}
```

**Usage**: If this endpoint fails to respond, the container should be restarted.

### Readiness Probe

The readiness probe checks all critical dependencies:

```typescript
// GET /api/health/ready
{
  "status": "healthy",  // healthy | degraded | unhealthy
  "timestamp": "2026-01-09T12:00:00.000Z",
  "uptime": 3600.5,
  "services": {
    "database": {
      "status": "healthy",
      "latency": 5,
      "details": {
        "totalConnections": 10,
        "idleConnections": 8,
        "waitingConnections": 0
      }
    },
    "redis": {
      "status": "healthy",
      "latency": 2
    },
    "storage": {
      "status": "healthy",
      "latency": 1,
      "details": {
        "path": "/var/www/.../flyer-images"
      }
    }
  }
}
```

**Status Logic**:

- `healthy` - All critical services (database, Redis) are healthy
- `degraded` - Some non-critical issues (high connection wait, storage issues)
- `unhealthy` - Critical service unavailable (returns 503)

### Startup Probe

The startup probe is used during container initialization:

```typescript
// GET /api/health/startup
// Success (200):
{
  "status": "started",
  "timestamp": "2026-01-09T12:00:00.000Z",
  "database": { "status": "healthy", "latency": 5 }
}

// Still starting (503):
{
  "status": "starting",
  "message": "Waiting for database connection",
  "database": { "status": "unhealthy", "message": "..." }
}
```

### Individual Service Endpoints

For detailed diagnostics:

| Endpoint                | Purpose                         |
| ----------------------- | ------------------------------- |
| `/api/health/ping`      | Simple server responsiveness    |
| `/api/health/db-schema` | Verify database tables exist    |
| `/api/health/db-pool`   | Database connection pool status |
| `/api/health/redis`     | Redis connectivity              |
| `/api/health/storage`   | File storage accessibility      |
| `/api/health/time`      | Server time synchronization     |

## Kubernetes Configuration Example

```yaml
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: flyer-crawler
      livenessProbe:
        httpGet:
          path: /api/health/live
          port: 3001
        initialDelaySeconds: 10
        periodSeconds: 15
        failureThreshold: 3

      readinessProbe:
        httpGet:
          path: /api/health/ready
          port: 3001
        initialDelaySeconds: 5
        periodSeconds: 10
        failureThreshold: 3

      startupProbe:
        httpGet:
          path: /api/health/startup
          port: 3001
        initialDelaySeconds: 0
        periodSeconds: 5
        failureThreshold: 30 # Allow up to 150 seconds for startup
```

## Docker Compose Configuration Example

```yaml
services:
  api:
    image: flyer-crawler:latest
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:3001/api/health/ready']
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
```

## PM2 Configuration Example

For non-containerized deployments using PM2:

```javascript
// ecosystem.config.js
module.exports = {
  apps: [
    {
      name: 'flyer-crawler',
      script: 'dist/server.js',
      // PM2 will check this endpoint
      // and restart if it fails
      health_check: {
        url: 'http://localhost:3001/api/health/ready',
        interval: 30000,
        timeout: 10000,
      },
    },
  ],
};
```

## Key Files

- `src/routes/health.routes.ts` - Health check endpoint implementations
- `server.ts` - Health routes mounted at `/api/health`

## Service Health Thresholds

| Service  | Healthy                | Degraded                | Unhealthy           |
| -------- | ---------------------- | ----------------------- | ------------------- |
| Database | Responds to `SELECT 1` | > 3 waiting connections | Connection fails    |
| Redis    | `PING` returns `PONG`  | N/A                     | Connection fails    |
| Storage  | Write access to path   | N/A                     | Path not accessible |