# ADR-053: Worker Health Checks and Stalled Job Monitoring **Date**: 2026-01-11 **Status**: Proposed ## Context Our application relies heavily on background workers (BullMQ) for flyer processing, analytics, and emails. If a worker process crashes (e.g., Out of Memory) or hangs, jobs may remain in the 'active' state indefinitely ("stalled") until BullMQ's fail-safe triggers. Currently, we lack: 1. Visibility into queue depths and worker status via HTTP endpoints (for uptime monitors). 2. A mechanism to detect if the worker process itself is alive, beyond just queue statistics. 3. Explicit configuration to ensure stalled jobs are recovered quickly. ## Decision We will implement a multi-layered health check strategy for background workers: 1. **Queue Metrics Endpoint**: Expose a protected endpoint `GET /health/queues` that returns the counts (waiting, active, failed) for all critical queues. 2. **Stalled Job Configuration**: Explicitly configure BullMQ workers with aggressive stall detection settings to recover quickly from crashes. 3. **Worker Heartbeats**: Workers will periodically update a "heartbeat" key in Redis. The health endpoint will check if this timestamp is recent. ## Implementation ### 1. BullMQ Worker Settings Workers must be initialized with specific options to handle stalls: ```typescript const workerOptions = { // Check for stalled jobs every 30 seconds stalledInterval: 30000, // Fail job after 3 stalls (prevents infinite loops causing infinite retries) maxStalledCount: 3, // Duration of the lock for the job in milliseconds. // If the worker doesn't renew this (e.g. crash), the job stalls. lockDuration: 30000, }; ``` ### 2. Health Endpoint Logic The `/health/queues` endpoint will: 1. Iterate through all defined queues (`flyerQueue`, `emailQueue`, etc.). 2. Fetch job counts (`waiting`, `active`, `failed`, `delayed`). 3. Return a 200 OK if queues are accessible, or 503 if Redis is unreachable. 4. (Future) Return 500 if the `waiting` count exceeds a critical threshold for too long. ## Consequences **Positive**: - Early detection of stuck processing pipelines. - Automatic recovery of stalled jobs via BullMQ configuration. - Metrics available for external monitoring tools (e.g., UptimeRobot, Datadog). **Negative**: - Requires configuring external monitoring to poll the new endpoint.