5.2 KiB
ADR-053: Worker Health Checks and Stalled Job Monitoring
Date: 2026-01-11
Status: Accepted (Fully Implemented)
Implementation Status:
- ✅ BullMQ worker stall configuration (complete)
- ✅ Basic health endpoints (/live, /ready, /redis, etc.)
- ✅ /health/queues endpoint (complete)
- ✅ Worker heartbeat mechanism (complete)
Context
Our application relies heavily on background workers (BullMQ) for flyer processing, analytics, and emails. If a worker process crashes (e.g., Out of Memory) or hangs, jobs may remain in the 'active' state indefinitely ("stalled") until BullMQ's fail-safe triggers.
Currently, we lack:
- Visibility into queue depths and worker status via HTTP endpoints (for uptime monitors).
- A mechanism to detect if the worker process itself is alive, beyond just queue statistics.
- Explicit configuration to ensure stalled jobs are recovered quickly.
Decision
We will implement a multi-layered health check strategy for background workers:
- Queue Metrics Endpoint: Expose a protected endpoint
GET /health/queuesthat returns the counts (waiting, active, failed) for all critical queues. - Stalled Job Configuration: Explicitly configure BullMQ workers with aggressive stall detection settings to recover quickly from crashes.
- Worker Heartbeats: Workers will periodically update a "heartbeat" key in Redis. The health endpoint will check if this timestamp is recent.
Implementation
1. BullMQ Worker Settings
Workers must be initialized with specific options to handle stalls:
const workerOptions = {
// Check for stalled jobs every 30 seconds
stalledInterval: 30000,
// Fail job after 3 stalls (prevents infinite loops causing infinite retries)
maxStalledCount: 3,
// Duration of the lock for the job in milliseconds.
// If the worker doesn't renew this (e.g. crash), the job stalls.
lockDuration: 30000,
};
2. Health Endpoint Logic
The /health/queues endpoint will:
- Iterate through all defined queues (
flyerQueue,emailQueue, etc.). - Fetch job counts (
waiting,active,failed,delayed). - Return a 200 OK if queues are accessible, or 503 if Redis is unreachable.
- (Future) Return 500 if the
waitingcount exceeds a critical threshold for too long.
Consequences
Positive:
- Early detection of stuck processing pipelines.
- Automatic recovery of stalled jobs via BullMQ configuration.
- Metrics available for external monitoring tools (e.g., UptimeRobot, Datadog).
Negative:
- Requires configuring external monitoring to poll the new endpoint.
Implementation Notes
Completed (2026-01-11)
-
BullMQ Stall Configuration -
src/config/workerOptions.ts- All workers use
defaultWorkerOptionswith:stalledInterval: 30000(30s)maxStalledCount: 3lockDuration: 30000(30s)
- Applied to all 9 workers: flyer, email, analytics, cleanup, weekly-analytics, token-cleanup, receipt, expiry-alert, barcode
- All workers use
-
Basic Health Endpoints -
src/routes/health.routes.ts/health/live- Liveness probe/health/ready- Readiness probe (checks DB, Redis, storage)/health/startup- Startup probe/health/redis- Redis connectivity/health/db-pool- Database connection pool status
Implementation Completed (2026-01-26)
-
/health/queuesEndpoint ✅-
Added route to
src/routes/health.routes.ts:511-674 -
Iterates through all 9 queues from
src/services/queues.server.ts -
Fetches job counts using BullMQ Queue API:
getJobCounts() -
Returns structured response including both queue metrics and worker heartbeats:
{ status: 'healthy' | 'unhealthy', timestamp: string, queues: { [queueName]: { waiting: number, active: number, failed: number, delayed: number } }, workers: { [workerName]: { alive: boolean, lastSeen?: string, pid?: number, host?: string } } } -
Returns 200 OK if all healthy, 503 if any queue/worker unavailable
-
Full OpenAPI documentation included
-
-
Worker Heartbeat Mechanism ✅
- Added
updateWorkerHeartbeat()andstartWorkerHeartbeat()insrc/services/workers.server.ts:100-149 - Key pattern:
worker:heartbeat:<worker-name> - Stores:
{ timestamp: ISO8601, pid: number, host: string } - Updates every 30s with 90s TTL
- Integrated with
/health/queuesendpoint (checks if heartbeat < 60s old) - Heartbeat intervals properly cleaned up in
closeWorkers()andgracefulShutdown()
- Added
-
Comprehensive Tests ✅
- Added 5 test cases in
src/routes/health.routes.test.ts:623-858 - Tests cover: healthy state, queue failures, stale heartbeats, missing heartbeats, Redis errors
- All tests follow existing patterns with proper mocking
- Added 5 test cases in
Future Enhancements (Not Implemented)
- Queue Depth Alerting (Low Priority)
- Add configurable thresholds per queue type
- Return 500 if
waitingcount exceeds threshold for extended period - Consider using Redis for storing threshold breach timestamps
- Estimate: 1-2 hours