Files
flyer-crawler.projectium.com/docs/adr/0053-worker-health-checks.md
2026-01-26 04:51:09 -08:00

143 lines
5.2 KiB
Markdown

# ADR-053: Worker Health Checks and Stalled Job Monitoring
**Date**: 2026-01-11
**Status**: Accepted (Fully Implemented)
**Implementation Status**:
- ✅ BullMQ worker stall configuration (complete)
- ✅ Basic health endpoints (/live, /ready, /redis, etc.)
- ✅ /health/queues endpoint (complete)
- ✅ Worker heartbeat mechanism (complete)
## Context
Our application relies heavily on background workers (BullMQ) for flyer processing, analytics, and emails. If a worker process crashes (e.g., Out of Memory) or hangs, jobs may remain in the 'active' state indefinitely ("stalled") until BullMQ's fail-safe triggers.
Currently, we lack:
1. Visibility into queue depths and worker status via HTTP endpoints (for uptime monitors).
2. A mechanism to detect if the worker process itself is alive, beyond just queue statistics.
3. Explicit configuration to ensure stalled jobs are recovered quickly.
## Decision
We will implement a multi-layered health check strategy for background workers:
1. **Queue Metrics Endpoint**: Expose a protected endpoint `GET /health/queues` that returns the counts (waiting, active, failed) for all critical queues.
2. **Stalled Job Configuration**: Explicitly configure BullMQ workers with aggressive stall detection settings to recover quickly from crashes.
3. **Worker Heartbeats**: Workers will periodically update a "heartbeat" key in Redis. The health endpoint will check if this timestamp is recent.
## Implementation
### 1. BullMQ Worker Settings
Workers must be initialized with specific options to handle stalls:
```typescript
const workerOptions = {
// Check for stalled jobs every 30 seconds
stalledInterval: 30000,
// Fail job after 3 stalls (prevents infinite loops causing infinite retries)
maxStalledCount: 3,
// Duration of the lock for the job in milliseconds.
// If the worker doesn't renew this (e.g. crash), the job stalls.
lockDuration: 30000,
};
```
### 2. Health Endpoint Logic
The `/health/queues` endpoint will:
1. Iterate through all defined queues (`flyerQueue`, `emailQueue`, etc.).
2. Fetch job counts (`waiting`, `active`, `failed`, `delayed`).
3. Return a 200 OK if queues are accessible, or 503 if Redis is unreachable.
4. (Future) Return 500 if the `waiting` count exceeds a critical threshold for too long.
## Consequences
**Positive**:
- Early detection of stuck processing pipelines.
- Automatic recovery of stalled jobs via BullMQ configuration.
- Metrics available for external monitoring tools (e.g., UptimeRobot, Datadog).
**Negative**:
- Requires configuring external monitoring to poll the new endpoint.
## Implementation Notes
### Completed (2026-01-11)
1. **BullMQ Stall Configuration** - `src/config/workerOptions.ts`
- All workers use `defaultWorkerOptions` with:
- `stalledInterval: 30000` (30s)
- `maxStalledCount: 3`
- `lockDuration: 30000` (30s)
- Applied to all 9 workers: flyer, email, analytics, cleanup, weekly-analytics, token-cleanup, receipt, expiry-alert, barcode
2. **Basic Health Endpoints** - `src/routes/health.routes.ts`
- `/health/live` - Liveness probe
- `/health/ready` - Readiness probe (checks DB, Redis, storage)
- `/health/startup` - Startup probe
- `/health/redis` - Redis connectivity
- `/health/db-pool` - Database connection pool status
### Implementation Completed (2026-01-26)
1. **`/health/queues` Endpoint** ✅
- Added route to `src/routes/health.routes.ts:511-674`
- Iterates through all 9 queues from `src/services/queues.server.ts`
- Fetches job counts using BullMQ Queue API: `getJobCounts()`
- Returns structured response including both queue metrics and worker heartbeats:
```typescript
{
status: 'healthy' | 'unhealthy',
timestamp: string,
queues: {
[queueName]: {
waiting: number,
active: number,
failed: number,
delayed: number
}
},
workers: {
[workerName]: {
alive: boolean,
lastSeen?: string,
pid?: number,
host?: string
}
}
}
```
- Returns 200 OK if all healthy, 503 if any queue/worker unavailable
- Full OpenAPI documentation included
2. **Worker Heartbeat Mechanism** ✅
- Added `updateWorkerHeartbeat()` and `startWorkerHeartbeat()` in `src/services/workers.server.ts:100-149`
- Key pattern: `worker:heartbeat:<worker-name>`
- Stores: `{ timestamp: ISO8601, pid: number, host: string }`
- Updates every 30s with 90s TTL
- Integrated with `/health/queues` endpoint (checks if heartbeat < 60s old)
- Heartbeat intervals properly cleaned up in `closeWorkers()` and `gracefulShutdown()`
3. **Comprehensive Tests** ✅
- Added 5 test cases in `src/routes/health.routes.test.ts:623-858`
- Tests cover: healthy state, queue failures, stale heartbeats, missing heartbeats, Redis errors
- All tests follow existing patterns with proper mocking
### Future Enhancements (Not Implemented)
1. **Queue Depth Alerting** (Low Priority)
- Add configurable thresholds per queue type
- Return 500 if `waiting` count exceeds threshold for extended period
- Consider using Redis for storing threshold breach timestamps
- **Estimate**: 1-2 hours