# ADR-053: Worker Health Checks and Stalled Job Monitoring **Date**: 2026-01-11 **Status**: Accepted (Fully Implemented) **Implementation Status**: - ✅ BullMQ worker stall configuration (complete) - ✅ Basic health endpoints (/live, /ready, /redis, etc.) - ✅ /health/queues endpoint (complete) - ✅ Worker heartbeat mechanism (complete) ## Context Our application relies heavily on background workers (BullMQ) for flyer processing, analytics, and emails. If a worker process crashes (e.g., Out of Memory) or hangs, jobs may remain in the 'active' state indefinitely ("stalled") until BullMQ's fail-safe triggers. Currently, we lack: 1. Visibility into queue depths and worker status via HTTP endpoints (for uptime monitors). 2. A mechanism to detect if the worker process itself is alive, beyond just queue statistics. 3. Explicit configuration to ensure stalled jobs are recovered quickly. ## Decision We will implement a multi-layered health check strategy for background workers: 1. **Queue Metrics Endpoint**: Expose a protected endpoint `GET /health/queues` that returns the counts (waiting, active, failed) for all critical queues. 2. **Stalled Job Configuration**: Explicitly configure BullMQ workers with aggressive stall detection settings to recover quickly from crashes. 3. **Worker Heartbeats**: Workers will periodically update a "heartbeat" key in Redis. The health endpoint will check if this timestamp is recent. ## Implementation ### 1. BullMQ Worker Settings Workers must be initialized with specific options to handle stalls: ```typescript const workerOptions = { // Check for stalled jobs every 30 seconds stalledInterval: 30000, // Fail job after 3 stalls (prevents infinite loops causing infinite retries) maxStalledCount: 3, // Duration of the lock for the job in milliseconds. // If the worker doesn't renew this (e.g. crash), the job stalls. lockDuration: 30000, }; ``` ### 2. Health Endpoint Logic The `/health/queues` endpoint will: 1. Iterate through all defined queues (`flyerQueue`, `emailQueue`, etc.). 2. Fetch job counts (`waiting`, `active`, `failed`, `delayed`). 3. Return a 200 OK if queues are accessible, or 503 if Redis is unreachable. 4. (Future) Return 500 if the `waiting` count exceeds a critical threshold for too long. ## Consequences **Positive**: - Early detection of stuck processing pipelines. - Automatic recovery of stalled jobs via BullMQ configuration. - Metrics available for external monitoring tools (e.g., UptimeRobot, Datadog). **Negative**: - Requires configuring external monitoring to poll the new endpoint. ## Implementation Notes ### Completed (2026-01-11) 1. **BullMQ Stall Configuration** - `src/config/workerOptions.ts` - All workers use `defaultWorkerOptions` with: - `stalledInterval: 30000` (30s) - `maxStalledCount: 3` - `lockDuration: 30000` (30s) - Applied to all 9 workers: flyer, email, analytics, cleanup, weekly-analytics, token-cleanup, receipt, expiry-alert, barcode 2. **Basic Health Endpoints** - `src/routes/health.routes.ts` - `/health/live` - Liveness probe - `/health/ready` - Readiness probe (checks DB, Redis, storage) - `/health/startup` - Startup probe - `/health/redis` - Redis connectivity - `/health/db-pool` - Database connection pool status ### Implementation Completed (2026-01-26) 1. **`/health/queues` Endpoint** ✅ - Added route to `src/routes/health.routes.ts:511-674` - Iterates through all 9 queues from `src/services/queues.server.ts` - Fetches job counts using BullMQ Queue API: `getJobCounts()` - Returns structured response including both queue metrics and worker heartbeats: ```typescript { status: 'healthy' | 'unhealthy', timestamp: string, queues: { [queueName]: { waiting: number, active: number, failed: number, delayed: number } }, workers: { [workerName]: { alive: boolean, lastSeen?: string, pid?: number, host?: string } } } ``` - Returns 200 OK if all healthy, 503 if any queue/worker unavailable - Full OpenAPI documentation included 2. **Worker Heartbeat Mechanism** ✅ - Added `updateWorkerHeartbeat()` and `startWorkerHeartbeat()` in `src/services/workers.server.ts:100-149` - Key pattern: `worker:heartbeat:` - Stores: `{ timestamp: ISO8601, pid: number, host: string }` - Updates every 30s with 90s TTL - Integrated with `/health/queues` endpoint (checks if heartbeat < 60s old) - Heartbeat intervals properly cleaned up in `closeWorkers()` and `gracefulShutdown()` 3. **Comprehensive Tests** ✅ - Added 5 test cases in `src/routes/health.routes.test.ts:623-858` - Tests cover: healthy state, queue failures, stale heartbeats, missing heartbeats, Redis errors - All tests follow existing patterns with proper mocking ### Future Enhancements (Not Implemented) 1. **Queue Depth Alerting** (Low Priority) - Add configurable thresholds per queue type - Return 500 if `waiting` count exceeds threshold for extended period - Consider using Redis for storing threshold breach timestamps - **Estimate**: 1-2 hours