Files
flyer-crawler.projectium.com/docs/adr/0053-worker-health-checks.md
Torben Sorensen 11aeac5edd
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 1m10s
whoa - so much - new features (UPC,etc) - Sentry for app logging! so much more !
2026-01-11 19:07:02 -08:00

2.3 KiB

ADR-053: Worker Health Checks and Stalled Job Monitoring

Date: 2026-01-11

Status: Proposed

Context

Our application relies heavily on background workers (BullMQ) for flyer processing, analytics, and emails. If a worker process crashes (e.g., Out of Memory) or hangs, jobs may remain in the 'active' state indefinitely ("stalled") until BullMQ's fail-safe triggers.

Currently, we lack:

  1. Visibility into queue depths and worker status via HTTP endpoints (for uptime monitors).
  2. A mechanism to detect if the worker process itself is alive, beyond just queue statistics.
  3. Explicit configuration to ensure stalled jobs are recovered quickly.

Decision

We will implement a multi-layered health check strategy for background workers:

  1. Queue Metrics Endpoint: Expose a protected endpoint GET /health/queues that returns the counts (waiting, active, failed) for all critical queues.
  2. Stalled Job Configuration: Explicitly configure BullMQ workers with aggressive stall detection settings to recover quickly from crashes.
  3. Worker Heartbeats: Workers will periodically update a "heartbeat" key in Redis. The health endpoint will check if this timestamp is recent.

Implementation

1. BullMQ Worker Settings

Workers must be initialized with specific options to handle stalls:

const workerOptions = {
  // Check for stalled jobs every 30 seconds
  stalledInterval: 30000,
  // Fail job after 3 stalls (prevents infinite loops causing infinite retries)
  maxStalledCount: 3,
  // Duration of the lock for the job in milliseconds.
  // If the worker doesn't renew this (e.g. crash), the job stalls.
  lockDuration: 30000,
};

2. Health Endpoint Logic

The /health/queues endpoint will:

  1. Iterate through all defined queues (flyerQueue, emailQueue, etc.).
  2. Fetch job counts (waiting, active, failed, delayed).
  3. Return a 200 OK if queues are accessible, or 503 if Redis is unreachable.
  4. (Future) Return 500 if the waiting count exceeds a critical threshold for too long.

Consequences

Positive:

  • Early detection of stuck processing pipelines.
  • Automatic recovery of stalled jobs via BullMQ configuration.
  • Metrics available for external monitoring tools (e.g., UptimeRobot, Datadog).

Negative:

  • Requires configuring external monitoring to poll the new endpoint.