Files
flyer-crawler.projectium.com/docs/adr/0038-graceful-shutdown-pattern.md

291 lines
8.7 KiB
Markdown

# ADR-038: Graceful Shutdown Pattern
**Date**: 2026-01-09
**Status**: Accepted
**Implemented**: 2026-01-09
## Context
When deploying or restarting the application, abrupt termination can cause:
1. **Lost Jobs**: BullMQ jobs in progress may be marked as failed or stalled.
2. **Connection Leaks**: Database and Redis connections may not be properly closed.
3. **Incomplete Requests**: HTTP requests in flight may receive no response.
4. **Data Corruption**: Transactions may be left in an inconsistent state.
Kubernetes and PM2 send termination signals (SIGTERM, SIGINT) to processes before forcefully killing them. The application must handle these signals to shut down gracefully.
## Decision
We will implement a coordinated graceful shutdown pattern that:
1. **Stops Accepting New Work**: Closes HTTP server, pauses job queues.
2. **Completes In-Flight Work**: Waits for active requests and jobs to finish.
3. **Releases Resources**: Closes database pools, Redis connections, and queues.
4. **Logs Shutdown Progress**: Provides visibility into the shutdown process.
### Signal Handling
| Signal | Source | Behavior |
| ------- | ------------------ | --------------------------------------- |
| SIGTERM | Kubernetes, PM2 | Graceful shutdown with resource cleanup |
| SIGINT | Ctrl+C in terminal | Same as SIGTERM |
| SIGKILL | Force kill | Cannot be caught; immediate termination |
## Implementation Details
### Queue and Worker Shutdown
Located in `src/services/queueService.server.ts`:
```typescript
import { logger } from './logger.server';
export const gracefulShutdown = async (signal: string): Promise<void> => {
logger.info(`[Shutdown] Received ${signal}. Closing all queues and workers...`);
const resources = [
{ name: 'flyerQueue', close: () => flyerQueue.close() },
{ name: 'emailQueue', close: () => emailQueue.close() },
{ name: 'analyticsQueue', close: () => analyticsQueue.close() },
{ name: 'weeklyAnalyticsQueue', close: () => weeklyAnalyticsQueue.close() },
{ name: 'cleanupQueue', close: () => cleanupQueue.close() },
{ name: 'tokenCleanupQueue', close: () => tokenCleanupQueue.close() },
{ name: 'redisConnection', close: () => connection.quit() },
];
const results = await Promise.allSettled(
resources.map(async (resource) => {
try {
await resource.close();
logger.info(`[Shutdown] ${resource.name} closed successfully.`);
} catch (error) {
logger.error({ err: error }, `[Shutdown] Error closing ${resource.name}`);
throw error;
}
}),
);
const failures = results.filter((r) => r.status === 'rejected');
if (failures.length > 0) {
logger.error(`[Shutdown] ${failures.length} resources failed to close.`);
}
logger.info('[Shutdown] All resources closed. Process can now exit.');
};
// Register signal handlers
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
```
### HTTP Server Shutdown
Located in `server.ts`:
```typescript
import { gracefulShutdown as shutdownQueues } from './src/services/queueService.server';
import { closePool } from './src/services/db/connection.db';
const server = app.listen(PORT, () => {
logger.info(`Server listening on port ${PORT}`);
});
const gracefulShutdown = async (signal: string): Promise<void> => {
logger.info(`[Shutdown] Received ${signal}. Starting graceful shutdown...`);
// 1. Stop accepting new connections
server.close((err) => {
if (err) {
logger.error({ err }, '[Shutdown] Error closing HTTP server');
} else {
logger.info('[Shutdown] HTTP server closed.');
}
});
// 2. Wait for in-flight requests (with timeout)
await new Promise((resolve) => setTimeout(resolve, 5000));
// 3. Close queues and workers
await shutdownQueues(signal);
// 4. Close database pool
await closePool();
logger.info('[Shutdown] Database pool closed.');
// 5. Exit process
process.exit(0);
};
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
```
### Database Pool Shutdown
Located in `src/services/db/connection.db.ts`:
```typescript
let pool: Pool | null = null;
export function getPool(): Pool {
if (!pool) {
pool = new Pool({
max: 20,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000,
});
}
return pool;
}
export async function closePool(): Promise<void> {
if (pool) {
await pool.end();
pool = null;
logger.info('[Database] Connection pool closed.');
}
}
export function getPoolStatus(): { totalCount: number; idleCount: number; waitingCount: number } {
const p = getPool();
return {
totalCount: p.totalCount,
idleCount: p.idleCount,
waitingCount: p.waitingCount,
};
}
```
### PM2 Ecosystem Configuration
Located in `ecosystem.config.cjs`:
```javascript
module.exports = {
apps: [
{
name: 'flyer-crawler-api',
script: 'server.ts',
interpreter: 'tsx',
// Graceful shutdown settings
kill_timeout: 10000, // 10 seconds to cleanup before SIGKILL
wait_ready: true, // Wait for 'ready' signal before considering app started
listen_timeout: 10000, // Timeout for ready signal
// Cluster mode for zero-downtime reloads
instances: 1,
exec_mode: 'fork',
// Environment variables
env_production: {
NODE_ENV: 'production',
PORT: 3000,
},
env_test: {
NODE_ENV: 'test',
PORT: 3001,
},
},
],
};
```
### Worker Graceful Shutdown
BullMQ workers can be configured to wait for active jobs:
```typescript
import { Worker } from 'bullmq';
const worker = new Worker('flyerQueue', processor, {
connection,
// Graceful shutdown: wait for active jobs before closing
settings: {
lockDuration: 30000, // Time before job is considered stalled
stalledInterval: 5000, // Check for stalled jobs every 5s
},
});
// Workers auto-close when connection closes
worker.on('closing', () => {
logger.info('[Worker] flyerQueue worker is closing...');
});
worker.on('closed', () => {
logger.info('[Worker] flyerQueue worker closed.');
});
```
### Shutdown Sequence Diagram
```text
SIGTERM Received
┌──────────────────────┐
│ Stop HTTP Server │ ← No new connections accepted
│ (server.close()) │
└──────────────────────┘
┌──────────────────────┐
│ Wait for In-Flight │ ← 5-second grace period
│ Requests │
└──────────────────────┘
┌──────────────────────┐
│ Close BullMQ Queues │ ← Stop processing new jobs
│ and Workers │
└──────────────────────┘
┌──────────────────────┐
│ Close Redis │ ← Disconnect from Redis
│ Connection │
└──────────────────────┘
┌──────────────────────┐
│ Close Database Pool │ ← Release all DB connections
│ (pool.end()) │
└──────────────────────┘
┌──────────────────────┐
│ process.exit(0) │ ← Clean exit
└──────────────────────┘
```
## Consequences
### Positive
- **Zero Lost Work**: In-flight requests and jobs complete before shutdown.
- **Clean Resource Cleanup**: All connections are properly closed.
- **Zero-Downtime Deploys**: PM2 can reload without dropping requests.
- **Observability**: Shutdown progress is logged for debugging.
### Negative
- **Shutdown Delay**: Takes 5-15 seconds to fully shutdown.
- **Complexity**: Multiple shutdown handlers must be coordinated.
- **Edge Cases**: Very long-running jobs may be killed if they exceed the grace period.
## Key Files
- `server.ts` - HTTP server shutdown and signal handling
- `src/services/queueService.server.ts` - Queue shutdown (`gracefulShutdown`)
- `src/services/db/connection.db.ts` - Database pool shutdown (`closePool`)
- `ecosystem.config.cjs` - PM2 configuration with `kill_timeout`
## Related ADRs
- [ADR-006](./0006-background-job-processing-and-task-queues.md) - Background Job Processing
- [ADR-020](./0020-health-checks-and-liveness-readiness-probes.md) - Health Checks
- [ADR-014](./0014-containerization-and-deployment-strategy.md) - Containerization