Files
flyer-crawler.projectium.com/docs/adr/0053-worker-health-checks.md
2026-01-26 04:51:09 -08:00

5.2 KiB

ADR-053: Worker Health Checks and Stalled Job Monitoring

Date: 2026-01-11

Status: Accepted (Fully Implemented)

Implementation Status:

  • BullMQ worker stall configuration (complete)
  • Basic health endpoints (/live, /ready, /redis, etc.)
  • /health/queues endpoint (complete)
  • Worker heartbeat mechanism (complete)

Context

Our application relies heavily on background workers (BullMQ) for flyer processing, analytics, and emails. If a worker process crashes (e.g., Out of Memory) or hangs, jobs may remain in the 'active' state indefinitely ("stalled") until BullMQ's fail-safe triggers.

Currently, we lack:

  1. Visibility into queue depths and worker status via HTTP endpoints (for uptime monitors).
  2. A mechanism to detect if the worker process itself is alive, beyond just queue statistics.
  3. Explicit configuration to ensure stalled jobs are recovered quickly.

Decision

We will implement a multi-layered health check strategy for background workers:

  1. Queue Metrics Endpoint: Expose a protected endpoint GET /health/queues that returns the counts (waiting, active, failed) for all critical queues.
  2. Stalled Job Configuration: Explicitly configure BullMQ workers with aggressive stall detection settings to recover quickly from crashes.
  3. Worker Heartbeats: Workers will periodically update a "heartbeat" key in Redis. The health endpoint will check if this timestamp is recent.

Implementation

1. BullMQ Worker Settings

Workers must be initialized with specific options to handle stalls:

const workerOptions = {
  // Check for stalled jobs every 30 seconds
  stalledInterval: 30000,
  // Fail job after 3 stalls (prevents infinite loops causing infinite retries)
  maxStalledCount: 3,
  // Duration of the lock for the job in milliseconds.
  // If the worker doesn't renew this (e.g. crash), the job stalls.
  lockDuration: 30000,
};

2. Health Endpoint Logic

The /health/queues endpoint will:

  1. Iterate through all defined queues (flyerQueue, emailQueue, etc.).
  2. Fetch job counts (waiting, active, failed, delayed).
  3. Return a 200 OK if queues are accessible, or 503 if Redis is unreachable.
  4. (Future) Return 500 if the waiting count exceeds a critical threshold for too long.

Consequences

Positive:

  • Early detection of stuck processing pipelines.
  • Automatic recovery of stalled jobs via BullMQ configuration.
  • Metrics available for external monitoring tools (e.g., UptimeRobot, Datadog).

Negative:

  • Requires configuring external monitoring to poll the new endpoint.

Implementation Notes

Completed (2026-01-11)

  1. BullMQ Stall Configuration - src/config/workerOptions.ts

    • All workers use defaultWorkerOptions with:
      • stalledInterval: 30000 (30s)
      • maxStalledCount: 3
      • lockDuration: 30000 (30s)
    • Applied to all 9 workers: flyer, email, analytics, cleanup, weekly-analytics, token-cleanup, receipt, expiry-alert, barcode
  2. Basic Health Endpoints - src/routes/health.routes.ts

    • /health/live - Liveness probe
    • /health/ready - Readiness probe (checks DB, Redis, storage)
    • /health/startup - Startup probe
    • /health/redis - Redis connectivity
    • /health/db-pool - Database connection pool status

Implementation Completed (2026-01-26)

  1. /health/queues Endpoint

    • Added route to src/routes/health.routes.ts:511-674

    • Iterates through all 9 queues from src/services/queues.server.ts

    • Fetches job counts using BullMQ Queue API: getJobCounts()

    • Returns structured response including both queue metrics and worker heartbeats:

      {
        status: 'healthy' | 'unhealthy',
        timestamp: string,
        queues: {
          [queueName]: {
            waiting: number,
            active: number,
            failed: number,
            delayed: number
          }
        },
        workers: {
          [workerName]: {
            alive: boolean,
            lastSeen?: string,
            pid?: number,
            host?: string
          }
        }
      }
      
    • Returns 200 OK if all healthy, 503 if any queue/worker unavailable

    • Full OpenAPI documentation included

  2. Worker Heartbeat Mechanism

    • Added updateWorkerHeartbeat() and startWorkerHeartbeat() in src/services/workers.server.ts:100-149
    • Key pattern: worker:heartbeat:<worker-name>
    • Stores: { timestamp: ISO8601, pid: number, host: string }
    • Updates every 30s with 90s TTL
    • Integrated with /health/queues endpoint (checks if heartbeat < 60s old)
    • Heartbeat intervals properly cleaned up in closeWorkers() and gracefulShutdown()
  3. Comprehensive Tests

    • Added 5 test cases in src/routes/health.routes.test.ts:623-858
    • Tests cover: healthy state, queue failures, stale heartbeats, missing heartbeats, Redis errors
    • All tests follow existing patterns with proper mocking

Future Enhancements (Not Implemented)

  1. Queue Depth Alerting (Low Priority)
    • Add configurable thresholds per queue type
    • Return 500 if waiting count exceeds threshold for extended period
    • Consider using Redis for storing threshold breach timestamps
    • Estimate: 1-2 hours