143 lines
5.2 KiB
Markdown
143 lines
5.2 KiB
Markdown
# ADR-053: Worker Health Checks and Stalled Job Monitoring
|
|
|
|
**Date**: 2026-01-11
|
|
|
|
**Status**: Accepted (Fully Implemented)
|
|
|
|
**Implementation Status**:
|
|
|
|
- ✅ BullMQ worker stall configuration (complete)
|
|
- ✅ Basic health endpoints (/live, /ready, /redis, etc.)
|
|
- ✅ /health/queues endpoint (complete)
|
|
- ✅ Worker heartbeat mechanism (complete)
|
|
|
|
## Context
|
|
|
|
Our application relies heavily on background workers (BullMQ) for flyer processing, analytics, and emails. If a worker process crashes (e.g., Out of Memory) or hangs, jobs may remain in the 'active' state indefinitely ("stalled") until BullMQ's fail-safe triggers.
|
|
|
|
Currently, we lack:
|
|
|
|
1. Visibility into queue depths and worker status via HTTP endpoints (for uptime monitors).
|
|
2. A mechanism to detect if the worker process itself is alive, beyond just queue statistics.
|
|
3. Explicit configuration to ensure stalled jobs are recovered quickly.
|
|
|
|
## Decision
|
|
|
|
We will implement a multi-layered health check strategy for background workers:
|
|
|
|
1. **Queue Metrics Endpoint**: Expose a protected endpoint `GET /health/queues` that returns the counts (waiting, active, failed) for all critical queues.
|
|
2. **Stalled Job Configuration**: Explicitly configure BullMQ workers with aggressive stall detection settings to recover quickly from crashes.
|
|
3. **Worker Heartbeats**: Workers will periodically update a "heartbeat" key in Redis. The health endpoint will check if this timestamp is recent.
|
|
|
|
## Implementation
|
|
|
|
### 1. BullMQ Worker Settings
|
|
|
|
Workers must be initialized with specific options to handle stalls:
|
|
|
|
```typescript
|
|
const workerOptions = {
|
|
// Check for stalled jobs every 30 seconds
|
|
stalledInterval: 30000,
|
|
// Fail job after 3 stalls (prevents infinite loops causing infinite retries)
|
|
maxStalledCount: 3,
|
|
// Duration of the lock for the job in milliseconds.
|
|
// If the worker doesn't renew this (e.g. crash), the job stalls.
|
|
lockDuration: 30000,
|
|
};
|
|
```
|
|
|
|
### 2. Health Endpoint Logic
|
|
|
|
The `/health/queues` endpoint will:
|
|
|
|
1. Iterate through all defined queues (`flyerQueue`, `emailQueue`, etc.).
|
|
2. Fetch job counts (`waiting`, `active`, `failed`, `delayed`).
|
|
3. Return a 200 OK if queues are accessible, or 503 if Redis is unreachable.
|
|
4. (Future) Return 500 if the `waiting` count exceeds a critical threshold for too long.
|
|
|
|
## Consequences
|
|
|
|
**Positive**:
|
|
|
|
- Early detection of stuck processing pipelines.
|
|
- Automatic recovery of stalled jobs via BullMQ configuration.
|
|
- Metrics available for external monitoring tools (e.g., UptimeRobot, Datadog).
|
|
|
|
**Negative**:
|
|
|
|
- Requires configuring external monitoring to poll the new endpoint.
|
|
|
|
## Implementation Notes
|
|
|
|
### Completed (2026-01-11)
|
|
|
|
1. **BullMQ Stall Configuration** - `src/config/workerOptions.ts`
|
|
- All workers use `defaultWorkerOptions` with:
|
|
- `stalledInterval: 30000` (30s)
|
|
- `maxStalledCount: 3`
|
|
- `lockDuration: 30000` (30s)
|
|
- Applied to all 9 workers: flyer, email, analytics, cleanup, weekly-analytics, token-cleanup, receipt, expiry-alert, barcode
|
|
|
|
2. **Basic Health Endpoints** - `src/routes/health.routes.ts`
|
|
- `/health/live` - Liveness probe
|
|
- `/health/ready` - Readiness probe (checks DB, Redis, storage)
|
|
- `/health/startup` - Startup probe
|
|
- `/health/redis` - Redis connectivity
|
|
- `/health/db-pool` - Database connection pool status
|
|
|
|
### Implementation Completed (2026-01-26)
|
|
|
|
1. **`/health/queues` Endpoint** ✅
|
|
- Added route to `src/routes/health.routes.ts:511-674`
|
|
- Iterates through all 9 queues from `src/services/queues.server.ts`
|
|
- Fetches job counts using BullMQ Queue API: `getJobCounts()`
|
|
- Returns structured response including both queue metrics and worker heartbeats:
|
|
|
|
```typescript
|
|
{
|
|
status: 'healthy' | 'unhealthy',
|
|
timestamp: string,
|
|
queues: {
|
|
[queueName]: {
|
|
waiting: number,
|
|
active: number,
|
|
failed: number,
|
|
delayed: number
|
|
}
|
|
},
|
|
workers: {
|
|
[workerName]: {
|
|
alive: boolean,
|
|
lastSeen?: string,
|
|
pid?: number,
|
|
host?: string
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
- Returns 200 OK if all healthy, 503 if any queue/worker unavailable
|
|
- Full OpenAPI documentation included
|
|
|
|
2. **Worker Heartbeat Mechanism** ✅
|
|
- Added `updateWorkerHeartbeat()` and `startWorkerHeartbeat()` in `src/services/workers.server.ts:100-149`
|
|
- Key pattern: `worker:heartbeat:<worker-name>`
|
|
- Stores: `{ timestamp: ISO8601, pid: number, host: string }`
|
|
- Updates every 30s with 90s TTL
|
|
- Integrated with `/health/queues` endpoint (checks if heartbeat < 60s old)
|
|
- Heartbeat intervals properly cleaned up in `closeWorkers()` and `gracefulShutdown()`
|
|
|
|
3. **Comprehensive Tests** ✅
|
|
- Added 5 test cases in `src/routes/health.routes.test.ts:623-858`
|
|
- Tests cover: healthy state, queue failures, stale heartbeats, missing heartbeats, Redis errors
|
|
- All tests follow existing patterns with proper mocking
|
|
|
|
### Future Enhancements (Not Implemented)
|
|
|
|
1. **Queue Depth Alerting** (Low Priority)
|
|
- Add configurable thresholds per queue type
|
|
- Return 500 if `waiting` count exceeds threshold for extended period
|
|
- Consider using Redis for storing threshold breach timestamps
|
|
- **Estimate**: 1-2 hours
|