Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 16m58s
217 lines
6.3 KiB
Markdown
217 lines
6.3 KiB
Markdown
# ADR-020: Health Checks and Liveness/Readiness Probes
|
|
|
|
**Date**: 2025-12-12
|
|
|
|
**Status**: Accepted
|
|
|
|
**Implemented**: 2026-01-09
|
|
|
|
## Context
|
|
|
|
When the application is containerized (`ADR-014`), the container orchestrator (e.g., Kubernetes, Docker Swarm) needs a way to determine if the application is running correctly. Without this, it cannot manage application lifecycle events like restarts or rolling updates effectively.
|
|
|
|
## Decision
|
|
|
|
We will implement dedicated health check endpoints in the Express application.
|
|
|
|
- A **Liveness Probe** (`/api/health/live`) will return a `200 OK` to indicate the server is running. If it fails, the orchestrator should restart the container.
|
|
|
|
- A **Readiness Probe** (`/api/health/ready`) will return a `200 OK` only if the application is ready to accept traffic (e.g., database connection is established). If it fails, the orchestrator will temporarily remove the container from the load balancer.
|
|
|
|
## Consequences
|
|
|
|
- **Positive**: Enables robust, automated application lifecycle management in a containerized environment. Prevents traffic from being sent to unhealthy or uninitialized application instances.
|
|
- **Negative**: Adds a small amount of code for the health check endpoints. Requires configuration in the container orchestration layer.
|
|
|
|
## Implementation Status
|
|
|
|
### What's Implemented
|
|
|
|
- ✅ **Liveness Probe** (`/api/health/live`) - Simple process health check
|
|
- ✅ **Readiness Probe** (`/api/health/ready`) - Comprehensive dependency health check
|
|
- ✅ **Startup Probe** (`/api/health/startup`) - Initial startup verification
|
|
- ✅ **Individual Service Checks** - Database, Redis, Storage endpoints
|
|
- ✅ **Detailed Health Response** - Service latency, status, and details
|
|
|
|
## Implementation Details
|
|
|
|
### Probe Endpoints
|
|
|
|
| Endpoint | Purpose | Checks | HTTP Status |
|
|
| --------------------- | --------------- | ------------------ | ----------------------------- |
|
|
| `/api/health/live` | Liveness probe | Process running | 200 = alive |
|
|
| `/api/health/ready` | Readiness probe | DB, Redis, Storage | 200 = ready, 503 = not ready |
|
|
| `/api/health/startup` | Startup probe | Database only | 200 = started, 503 = starting |
|
|
|
|
### Liveness Probe
|
|
|
|
The liveness probe is intentionally simple with no external dependencies:
|
|
|
|
```typescript
|
|
// GET /api/health/live
|
|
{
|
|
"status": "ok",
|
|
"timestamp": "2026-01-09T12:00:00.000Z"
|
|
}
|
|
```
|
|
|
|
**Usage**: If this endpoint fails to respond, the container should be restarted.
|
|
|
|
### Readiness Probe
|
|
|
|
The readiness probe checks all critical dependencies:
|
|
|
|
```typescript
|
|
// GET /api/health/ready
|
|
{
|
|
"status": "healthy", // healthy | degraded | unhealthy
|
|
"timestamp": "2026-01-09T12:00:00.000Z",
|
|
"uptime": 3600.5,
|
|
"services": {
|
|
"database": {
|
|
"status": "healthy",
|
|
"latency": 5,
|
|
"details": {
|
|
"totalConnections": 10,
|
|
"idleConnections": 8,
|
|
"waitingConnections": 0
|
|
}
|
|
},
|
|
"redis": {
|
|
"status": "healthy",
|
|
"latency": 2
|
|
},
|
|
"storage": {
|
|
"status": "healthy",
|
|
"latency": 1,
|
|
"details": {
|
|
"path": "/var/www/.../flyer-images"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Status Logic**:
|
|
|
|
- `healthy` - All critical services (database, Redis) are healthy
|
|
- `degraded` - Some non-critical issues (high connection wait, storage issues)
|
|
- `unhealthy` - Critical service unavailable (returns 503)
|
|
|
|
### Startup Probe
|
|
|
|
The startup probe is used during container initialization:
|
|
|
|
```typescript
|
|
// GET /api/health/startup
|
|
// Success (200):
|
|
{
|
|
"status": "started",
|
|
"timestamp": "2026-01-09T12:00:00.000Z",
|
|
"database": { "status": "healthy", "latency": 5 }
|
|
}
|
|
|
|
// Still starting (503):
|
|
{
|
|
"status": "starting",
|
|
"message": "Waiting for database connection",
|
|
"database": { "status": "unhealthy", "message": "..." }
|
|
}
|
|
```
|
|
|
|
### Individual Service Endpoints
|
|
|
|
For detailed diagnostics:
|
|
|
|
| Endpoint | Purpose |
|
|
| ----------------------- | ------------------------------- |
|
|
| `/api/health/ping` | Simple server responsiveness |
|
|
| `/api/health/db-schema` | Verify database tables exist |
|
|
| `/api/health/db-pool` | Database connection pool status |
|
|
| `/api/health/redis` | Redis connectivity |
|
|
| `/api/health/storage` | File storage accessibility |
|
|
| `/api/health/time` | Server time synchronization |
|
|
|
|
## Kubernetes Configuration Example
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Pod
|
|
spec:
|
|
containers:
|
|
- name: flyer-crawler
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /api/health/live
|
|
port: 3001
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 15
|
|
failureThreshold: 3
|
|
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /api/health/ready
|
|
port: 3001
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 10
|
|
failureThreshold: 3
|
|
|
|
startupProbe:
|
|
httpGet:
|
|
path: /api/health/startup
|
|
port: 3001
|
|
initialDelaySeconds: 0
|
|
periodSeconds: 5
|
|
failureThreshold: 30 # Allow up to 150 seconds for startup
|
|
```
|
|
|
|
## Docker Compose Configuration Example
|
|
|
|
```yaml
|
|
services:
|
|
api:
|
|
image: flyer-crawler:latest
|
|
healthcheck:
|
|
test: ['CMD', 'curl', '-f', 'http://localhost:3001/api/health/ready']
|
|
interval: 30s
|
|
timeout: 10s
|
|
retries: 3
|
|
start_period: 40s
|
|
```
|
|
|
|
## PM2 Configuration Example
|
|
|
|
For non-containerized deployments using PM2:
|
|
|
|
```javascript
|
|
// ecosystem.config.js
|
|
module.exports = {
|
|
apps: [
|
|
{
|
|
name: 'flyer-crawler',
|
|
script: 'dist/server.js',
|
|
// PM2 will check this endpoint
|
|
// and restart if it fails
|
|
health_check: {
|
|
url: 'http://localhost:3001/api/health/ready',
|
|
interval: 30000,
|
|
timeout: 10000,
|
|
},
|
|
},
|
|
],
|
|
};
|
|
```
|
|
|
|
## Key Files
|
|
|
|
- `src/routes/health.routes.ts` - Health check endpoint implementations
|
|
- `server.ts` - Health routes mounted at `/api/health`
|
|
|
|
## Service Health Thresholds
|
|
|
|
| Service | Healthy | Degraded | Unhealthy |
|
|
| -------- | ---------------------- | ----------------------- | ------------------- |
|
|
| Database | Responds to `SELECT 1` | > 3 waiting connections | Connection fails |
|
|
| Redis | `PING` returns `PONG` | N/A | Connection fails |
|
|
| Storage | Write access to path | N/A | Path not accessible |
|