Files
flyer-crawler.projectium.com/docs/adr/0020-health-checks-and-liveness-readiness-probes.md
Torben Sorensen 4a04e478c4
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 16m58s
integration test fixes - claude for the win? try 4 - i have a good feeling
2026-01-09 05:56:19 -08:00

217 lines
6.3 KiB
Markdown

# ADR-020: Health Checks and Liveness/Readiness Probes
**Date**: 2025-12-12
**Status**: Accepted
**Implemented**: 2026-01-09
## Context
When the application is containerized (`ADR-014`), the container orchestrator (e.g., Kubernetes, Docker Swarm) needs a way to determine if the application is running correctly. Without this, it cannot manage application lifecycle events like restarts or rolling updates effectively.
## Decision
We will implement dedicated health check endpoints in the Express application.
- A **Liveness Probe** (`/api/health/live`) will return a `200 OK` to indicate the server is running. If it fails, the orchestrator should restart the container.
- A **Readiness Probe** (`/api/health/ready`) will return a `200 OK` only if the application is ready to accept traffic (e.g., database connection is established). If it fails, the orchestrator will temporarily remove the container from the load balancer.
## Consequences
- **Positive**: Enables robust, automated application lifecycle management in a containerized environment. Prevents traffic from being sent to unhealthy or uninitialized application instances.
- **Negative**: Adds a small amount of code for the health check endpoints. Requires configuration in the container orchestration layer.
## Implementation Status
### What's Implemented
-**Liveness Probe** (`/api/health/live`) - Simple process health check
-**Readiness Probe** (`/api/health/ready`) - Comprehensive dependency health check
-**Startup Probe** (`/api/health/startup`) - Initial startup verification
-**Individual Service Checks** - Database, Redis, Storage endpoints
-**Detailed Health Response** - Service latency, status, and details
## Implementation Details
### Probe Endpoints
| Endpoint | Purpose | Checks | HTTP Status |
| --------------------- | --------------- | ------------------ | ----------------------------- |
| `/api/health/live` | Liveness probe | Process running | 200 = alive |
| `/api/health/ready` | Readiness probe | DB, Redis, Storage | 200 = ready, 503 = not ready |
| `/api/health/startup` | Startup probe | Database only | 200 = started, 503 = starting |
### Liveness Probe
The liveness probe is intentionally simple with no external dependencies:
```typescript
// GET /api/health/live
{
"status": "ok",
"timestamp": "2026-01-09T12:00:00.000Z"
}
```
**Usage**: If this endpoint fails to respond, the container should be restarted.
### Readiness Probe
The readiness probe checks all critical dependencies:
```typescript
// GET /api/health/ready
{
"status": "healthy", // healthy | degraded | unhealthy
"timestamp": "2026-01-09T12:00:00.000Z",
"uptime": 3600.5,
"services": {
"database": {
"status": "healthy",
"latency": 5,
"details": {
"totalConnections": 10,
"idleConnections": 8,
"waitingConnections": 0
}
},
"redis": {
"status": "healthy",
"latency": 2
},
"storage": {
"status": "healthy",
"latency": 1,
"details": {
"path": "/var/www/.../flyer-images"
}
}
}
}
```
**Status Logic**:
- `healthy` - All critical services (database, Redis) are healthy
- `degraded` - Some non-critical issues (high connection wait, storage issues)
- `unhealthy` - Critical service unavailable (returns 503)
### Startup Probe
The startup probe is used during container initialization:
```typescript
// GET /api/health/startup
// Success (200):
{
"status": "started",
"timestamp": "2026-01-09T12:00:00.000Z",
"database": { "status": "healthy", "latency": 5 }
}
// Still starting (503):
{
"status": "starting",
"message": "Waiting for database connection",
"database": { "status": "unhealthy", "message": "..." }
}
```
### Individual Service Endpoints
For detailed diagnostics:
| Endpoint | Purpose |
| ----------------------- | ------------------------------- |
| `/api/health/ping` | Simple server responsiveness |
| `/api/health/db-schema` | Verify database tables exist |
| `/api/health/db-pool` | Database connection pool status |
| `/api/health/redis` | Redis connectivity |
| `/api/health/storage` | File storage accessibility |
| `/api/health/time` | Server time synchronization |
## Kubernetes Configuration Example
```yaml
apiVersion: v1
kind: Pod
spec:
containers:
- name: flyer-crawler
livenessProbe:
httpGet:
path: /api/health/live
port: 3001
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/health/ready
port: 3001
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
startupProbe:
httpGet:
path: /api/health/startup
port: 3001
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 # Allow up to 150 seconds for startup
```
## Docker Compose Configuration Example
```yaml
services:
api:
image: flyer-crawler:latest
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:3001/api/health/ready']
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
```
## PM2 Configuration Example
For non-containerized deployments using PM2:
```javascript
// ecosystem.config.js
module.exports = {
apps: [
{
name: 'flyer-crawler',
script: 'dist/server.js',
// PM2 will check this endpoint
// and restart if it fails
health_check: {
url: 'http://localhost:3001/api/health/ready',
interval: 30000,
timeout: 10000,
},
},
],
};
```
## Key Files
- `src/routes/health.routes.ts` - Health check endpoint implementations
- `server.ts` - Health routes mounted at `/api/health`
## Service Health Thresholds
| Service | Healthy | Degraded | Unhealthy |
| -------- | ---------------------- | ----------------------- | ------------------- |
| Database | Responds to `SELECT 1` | > 3 waiting connections | Connection fails |
| Redis | `PING` returns `PONG` | N/A | Connection fails |
| Storage | Write access to path | N/A | Path not accessible |