Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 1m10s
63 lines
2.3 KiB
Markdown
63 lines
2.3 KiB
Markdown
# ADR-053: Worker Health Checks and Stalled Job Monitoring
|
|
|
|
**Date**: 2026-01-11
|
|
|
|
**Status**: Proposed
|
|
|
|
## Context
|
|
|
|
Our application relies heavily on background workers (BullMQ) for flyer processing, analytics, and emails. If a worker process crashes (e.g., Out of Memory) or hangs, jobs may remain in the 'active' state indefinitely ("stalled") until BullMQ's fail-safe triggers.
|
|
|
|
Currently, we lack:
|
|
|
|
1. Visibility into queue depths and worker status via HTTP endpoints (for uptime monitors).
|
|
2. A mechanism to detect if the worker process itself is alive, beyond just queue statistics.
|
|
3. Explicit configuration to ensure stalled jobs are recovered quickly.
|
|
|
|
## Decision
|
|
|
|
We will implement a multi-layered health check strategy for background workers:
|
|
|
|
1. **Queue Metrics Endpoint**: Expose a protected endpoint `GET /health/queues` that returns the counts (waiting, active, failed) for all critical queues.
|
|
2. **Stalled Job Configuration**: Explicitly configure BullMQ workers with aggressive stall detection settings to recover quickly from crashes.
|
|
3. **Worker Heartbeats**: Workers will periodically update a "heartbeat" key in Redis. The health endpoint will check if this timestamp is recent.
|
|
|
|
## Implementation
|
|
|
|
### 1. BullMQ Worker Settings
|
|
|
|
Workers must be initialized with specific options to handle stalls:
|
|
|
|
```typescript
|
|
const workerOptions = {
|
|
// Check for stalled jobs every 30 seconds
|
|
stalledInterval: 30000,
|
|
// Fail job after 3 stalls (prevents infinite loops causing infinite retries)
|
|
maxStalledCount: 3,
|
|
// Duration of the lock for the job in milliseconds.
|
|
// If the worker doesn't renew this (e.g. crash), the job stalls.
|
|
lockDuration: 30000,
|
|
};
|
|
```
|
|
|
|
### 2. Health Endpoint Logic
|
|
|
|
The `/health/queues` endpoint will:
|
|
|
|
1. Iterate through all defined queues (`flyerQueue`, `emailQueue`, etc.).
|
|
2. Fetch job counts (`waiting`, `active`, `failed`, `delayed`).
|
|
3. Return a 200 OK if queues are accessible, or 503 if Redis is unreachable.
|
|
4. (Future) Return 500 if the `waiting` count exceeds a critical threshold for too long.
|
|
|
|
## Consequences
|
|
|
|
**Positive**:
|
|
|
|
- Early detection of stuck processing pipelines.
|
|
- Automatic recovery of stalled jobs via BullMQ configuration.
|
|
- Metrics available for external monitoring tools (e.g., UptimeRobot, Datadog).
|
|
|
|
**Negative**:
|
|
|
|
- Requires configuring external monitoring to poll the new endpoint.
|