291 lines
8.7 KiB
Markdown
291 lines
8.7 KiB
Markdown
# ADR-038: Graceful Shutdown Pattern
|
|
|
|
**Date**: 2026-01-09
|
|
|
|
**Status**: Accepted
|
|
|
|
**Implemented**: 2026-01-09
|
|
|
|
## Context
|
|
|
|
When deploying or restarting the application, abrupt termination can cause:
|
|
|
|
1. **Lost Jobs**: BullMQ jobs in progress may be marked as failed or stalled.
|
|
2. **Connection Leaks**: Database and Redis connections may not be properly closed.
|
|
3. **Incomplete Requests**: HTTP requests in flight may receive no response.
|
|
4. **Data Corruption**: Transactions may be left in an inconsistent state.
|
|
|
|
Kubernetes and PM2 send termination signals (SIGTERM, SIGINT) to processes before forcefully killing them. The application must handle these signals to shut down gracefully.
|
|
|
|
## Decision
|
|
|
|
We will implement a coordinated graceful shutdown pattern that:
|
|
|
|
1. **Stops Accepting New Work**: Closes HTTP server, pauses job queues.
|
|
2. **Completes In-Flight Work**: Waits for active requests and jobs to finish.
|
|
3. **Releases Resources**: Closes database pools, Redis connections, and queues.
|
|
4. **Logs Shutdown Progress**: Provides visibility into the shutdown process.
|
|
|
|
### Signal Handling
|
|
|
|
| Signal | Source | Behavior |
|
|
| ------- | ------------------ | --------------------------------------- |
|
|
| SIGTERM | Kubernetes, PM2 | Graceful shutdown with resource cleanup |
|
|
| SIGINT | Ctrl+C in terminal | Same as SIGTERM |
|
|
| SIGKILL | Force kill | Cannot be caught; immediate termination |
|
|
|
|
## Implementation Details
|
|
|
|
### Queue and Worker Shutdown
|
|
|
|
Located in `src/services/queueService.server.ts`:
|
|
|
|
```typescript
|
|
import { logger } from './logger.server';
|
|
|
|
export const gracefulShutdown = async (signal: string): Promise<void> => {
|
|
logger.info(`[Shutdown] Received ${signal}. Closing all queues and workers...`);
|
|
|
|
const resources = [
|
|
{ name: 'flyerQueue', close: () => flyerQueue.close() },
|
|
{ name: 'emailQueue', close: () => emailQueue.close() },
|
|
{ name: 'analyticsQueue', close: () => analyticsQueue.close() },
|
|
{ name: 'weeklyAnalyticsQueue', close: () => weeklyAnalyticsQueue.close() },
|
|
{ name: 'cleanupQueue', close: () => cleanupQueue.close() },
|
|
{ name: 'tokenCleanupQueue', close: () => tokenCleanupQueue.close() },
|
|
{ name: 'redisConnection', close: () => connection.quit() },
|
|
];
|
|
|
|
const results = await Promise.allSettled(
|
|
resources.map(async (resource) => {
|
|
try {
|
|
await resource.close();
|
|
logger.info(`[Shutdown] ${resource.name} closed successfully.`);
|
|
} catch (error) {
|
|
logger.error({ err: error }, `[Shutdown] Error closing ${resource.name}`);
|
|
throw error;
|
|
}
|
|
}),
|
|
);
|
|
|
|
const failures = results.filter((r) => r.status === 'rejected');
|
|
if (failures.length > 0) {
|
|
logger.error(`[Shutdown] ${failures.length} resources failed to close.`);
|
|
}
|
|
|
|
logger.info('[Shutdown] All resources closed. Process can now exit.');
|
|
};
|
|
|
|
// Register signal handlers
|
|
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
|
|
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
|
|
```
|
|
|
|
### HTTP Server Shutdown
|
|
|
|
Located in `server.ts`:
|
|
|
|
```typescript
|
|
import { gracefulShutdown as shutdownQueues } from './src/services/queueService.server';
|
|
import { closePool } from './src/services/db/connection.db';
|
|
|
|
const server = app.listen(PORT, () => {
|
|
logger.info(`Server listening on port ${PORT}`);
|
|
});
|
|
|
|
const gracefulShutdown = async (signal: string): Promise<void> => {
|
|
logger.info(`[Shutdown] Received ${signal}. Starting graceful shutdown...`);
|
|
|
|
// 1. Stop accepting new connections
|
|
server.close((err) => {
|
|
if (err) {
|
|
logger.error({ err }, '[Shutdown] Error closing HTTP server');
|
|
} else {
|
|
logger.info('[Shutdown] HTTP server closed.');
|
|
}
|
|
});
|
|
|
|
// 2. Wait for in-flight requests (with timeout)
|
|
await new Promise((resolve) => setTimeout(resolve, 5000));
|
|
|
|
// 3. Close queues and workers
|
|
await shutdownQueues(signal);
|
|
|
|
// 4. Close database pool
|
|
await closePool();
|
|
logger.info('[Shutdown] Database pool closed.');
|
|
|
|
// 5. Exit process
|
|
process.exit(0);
|
|
};
|
|
|
|
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
|
|
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
|
|
```
|
|
|
|
### Database Pool Shutdown
|
|
|
|
Located in `src/services/db/connection.db.ts`:
|
|
|
|
```typescript
|
|
let pool: Pool | null = null;
|
|
|
|
export function getPool(): Pool {
|
|
if (!pool) {
|
|
pool = new Pool({
|
|
max: 20,
|
|
idleTimeoutMillis: 30000,
|
|
connectionTimeoutMillis: 2000,
|
|
});
|
|
}
|
|
return pool;
|
|
}
|
|
|
|
export async function closePool(): Promise<void> {
|
|
if (pool) {
|
|
await pool.end();
|
|
pool = null;
|
|
logger.info('[Database] Connection pool closed.');
|
|
}
|
|
}
|
|
|
|
export function getPoolStatus(): { totalCount: number; idleCount: number; waitingCount: number } {
|
|
const p = getPool();
|
|
return {
|
|
totalCount: p.totalCount,
|
|
idleCount: p.idleCount,
|
|
waitingCount: p.waitingCount,
|
|
};
|
|
}
|
|
```
|
|
|
|
### PM2 Ecosystem Configuration
|
|
|
|
Located in `ecosystem.config.cjs`:
|
|
|
|
```javascript
|
|
module.exports = {
|
|
apps: [
|
|
{
|
|
name: 'flyer-crawler-api',
|
|
script: 'server.ts',
|
|
interpreter: 'tsx',
|
|
|
|
// Graceful shutdown settings
|
|
kill_timeout: 10000, // 10 seconds to cleanup before SIGKILL
|
|
wait_ready: true, // Wait for 'ready' signal before considering app started
|
|
listen_timeout: 10000, // Timeout for ready signal
|
|
|
|
// Cluster mode for zero-downtime reloads
|
|
instances: 1,
|
|
exec_mode: 'fork',
|
|
|
|
// Environment variables
|
|
env_production: {
|
|
NODE_ENV: 'production',
|
|
PORT: 3000,
|
|
},
|
|
env_test: {
|
|
NODE_ENV: 'test',
|
|
PORT: 3001,
|
|
},
|
|
},
|
|
],
|
|
};
|
|
```
|
|
|
|
### Worker Graceful Shutdown
|
|
|
|
BullMQ workers can be configured to wait for active jobs:
|
|
|
|
```typescript
|
|
import { Worker } from 'bullmq';
|
|
|
|
const worker = new Worker('flyerQueue', processor, {
|
|
connection,
|
|
// Graceful shutdown: wait for active jobs before closing
|
|
settings: {
|
|
lockDuration: 30000, // Time before job is considered stalled
|
|
stalledInterval: 5000, // Check for stalled jobs every 5s
|
|
},
|
|
});
|
|
|
|
// Workers auto-close when connection closes
|
|
worker.on('closing', () => {
|
|
logger.info('[Worker] flyerQueue worker is closing...');
|
|
});
|
|
|
|
worker.on('closed', () => {
|
|
logger.info('[Worker] flyerQueue worker closed.');
|
|
});
|
|
```
|
|
|
|
### Shutdown Sequence Diagram
|
|
|
|
```text
|
|
SIGTERM Received
|
|
│
|
|
▼
|
|
┌──────────────────────┐
|
|
│ Stop HTTP Server │ ← No new connections accepted
|
|
│ (server.close()) │
|
|
└──────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────┐
|
|
│ Wait for In-Flight │ ← 5-second grace period
|
|
│ Requests │
|
|
└──────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────┐
|
|
│ Close BullMQ Queues │ ← Stop processing new jobs
|
|
│ and Workers │
|
|
└──────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────┐
|
|
│ Close Redis │ ← Disconnect from Redis
|
|
│ Connection │
|
|
└──────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────┐
|
|
│ Close Database Pool │ ← Release all DB connections
|
|
│ (pool.end()) │
|
|
└──────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────┐
|
|
│ process.exit(0) │ ← Clean exit
|
|
└──────────────────────┘
|
|
```
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- **Zero Lost Work**: In-flight requests and jobs complete before shutdown.
|
|
- **Clean Resource Cleanup**: All connections are properly closed.
|
|
- **Zero-Downtime Deploys**: PM2 can reload without dropping requests.
|
|
- **Observability**: Shutdown progress is logged for debugging.
|
|
|
|
### Negative
|
|
|
|
- **Shutdown Delay**: Takes 5-15 seconds to fully shutdown.
|
|
- **Complexity**: Multiple shutdown handlers must be coordinated.
|
|
- **Edge Cases**: Very long-running jobs may be killed if they exceed the grace period.
|
|
|
|
## Key Files
|
|
|
|
- `server.ts` - HTTP server shutdown and signal handling
|
|
- `src/services/queueService.server.ts` - Queue shutdown (`gracefulShutdown`)
|
|
- `src/services/db/connection.db.ts` - Database pool shutdown (`closePool`)
|
|
- `ecosystem.config.cjs` - PM2 configuration with `kill_timeout`
|
|
|
|
## Related ADRs
|
|
|
|
- [ADR-006](./0006-background-job-processing-and-task-queues.md) - Background Job Processing
|
|
- [ADR-020](./0020-health-checks-and-liveness-readiness-probes.md) - Health Checks
|
|
- [ADR-014](./0014-containerization-and-deployment-strategy.md) - Containerization
|