ADR-053: Worker Health Checks and Stalled Job Monitoring

Date: 2026-01-11

Status: Proposed

Context

Our application relies heavily on background workers (BullMQ) for flyer processing, analytics, and emails. If a worker process crashes (e.g., Out of Memory) or hangs, jobs may remain in the 'active' state indefinitely ("stalled") until BullMQ's fail-safe triggers.

Currently, we lack:

Visibility into queue depths and worker status via HTTP endpoints (for uptime monitors).
A mechanism to detect if the worker process itself is alive, beyond just queue statistics.
Explicit configuration to ensure stalled jobs are recovered quickly.

Decision

We will implement a multi-layered health check strategy for background workers:

Queue Metrics Endpoint: Expose a protected endpoint GET /health/queues that returns the counts (waiting, active, failed) for all critical queues.
Stalled Job Configuration: Explicitly configure BullMQ workers with aggressive stall detection settings to recover quickly from crashes.
Worker Heartbeats: Workers will periodically update a "heartbeat" key in Redis. The health endpoint will check if this timestamp is recent.

Implementation

1. BullMQ Worker Settings

Workers must be initialized with specific options to handle stalls:

const workerOptions = {
  // Check for stalled jobs every 30 seconds
  stalledInterval: 30000,
  // Fail job after 3 stalls (prevents infinite loops causing infinite retries)
  maxStalledCount: 3,
  // Duration of the lock for the job in milliseconds.
  // If the worker doesn't renew this (e.g. crash), the job stalls.
  lockDuration: 30000,
};

2. Health Endpoint Logic

The /health/queues endpoint will:

Iterate through all defined queues (flyerQueue, emailQueue, etc.).
Fetch job counts (waiting, active, failed, delayed).
Return a 200 OK if queues are accessible, or 503 if Redis is unreachable.
(Future) Return 500 if the waiting count exceeds a critical threshold for too long.

Consequences

Positive:

Early detection of stuck processing pipelines.
Automatic recovery of stalled jobs via BullMQ configuration.
Metrics available for external monitoring tools (e.g., UptimeRobot, Datadog).

Negative:

Requires configuring external monitoring to poll the new endpoint.

2.3 KiB Raw Blame History