All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 27m44s
225 lines
8.3 KiB
Markdown
225 lines
8.3 KiB
Markdown
# ADR-061: PM2 Process Isolation Safeguards
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
On 2026-02-17, a critical incident occurred during v0.15.0 production deployment where ALL PM2 processes on the production server were terminated, not just flyer-crawler processes. This caused unplanned downtime for multiple applications including `stock-alert.projectium.com`.
|
|
|
|
### Problem Statement
|
|
|
|
Production and test environments share the same PM2 daemon on the server. This creates a risk where deployment scripts that operate on PM2 processes can accidentally affect processes belonging to other applications or environments.
|
|
|
|
### Pre-existing Controls
|
|
|
|
Prior to the incident, PM2 process isolation controls were already in place (commit `b6a62a0`):
|
|
|
|
- Production workflows used whitelist-based filtering with explicit process names
|
|
- Test workflows filtered by `-test` suffix pattern
|
|
- CLAUDE.md documented the prohibition of `pm2 stop all`, `pm2 delete all`, and `pm2 restart all`
|
|
|
|
Despite these controls being present in the codebase and included in v0.15.0, the incident still occurred. The leading hypothesis is that the Gitea runner executed a cached/older version of the workflow file.
|
|
|
|
### Requirements
|
|
|
|
1. Prevent accidental deletion of processes from other applications or environments
|
|
2. Provide audit trail for forensic analysis when incidents occur
|
|
3. Enable automatic abort when dangerous conditions are detected
|
|
4. Maintain visibility into PM2 operations during deployment
|
|
5. Work correctly even if the filtering logic itself is bypassed
|
|
|
|
## Decision
|
|
|
|
Implement a defense-in-depth strategy with 5 layers of safeguards in all deployment workflows that interact with PM2 processes.
|
|
|
|
### Safeguard Layers
|
|
|
|
#### Layer 1: Workflow Metadata Logging
|
|
|
|
Log workflow execution metadata at the start of each deployment:
|
|
|
|
```bash
|
|
echo "=== WORKFLOW METADATA ==="
|
|
echo "Workflow file: deploy-to-prod.yml"
|
|
echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
|
|
echo "Git commit: $(git rev-parse HEAD)"
|
|
echo "Git branch: $(git rev-parse --abbrev-ref HEAD)"
|
|
echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
|
|
echo "Actor: ${{ gitea.actor }}"
|
|
echo "=== END METADATA ==="
|
|
```
|
|
|
|
**Purpose**: Enables verification of which workflow version was actually executed.
|
|
|
|
#### Layer 2: Pre-Cleanup PM2 State Logging
|
|
|
|
Capture full PM2 process list before any modifications:
|
|
|
|
```bash
|
|
echo "=== PRE-CLEANUP PM2 STATE ==="
|
|
pm2 jlist
|
|
echo "=== END PRE-CLEANUP STATE ==="
|
|
```
|
|
|
|
**Purpose**: Provides forensic evidence of system state before cleanup.
|
|
|
|
#### Layer 3: Process Count Validation (SAFETY ABORT)
|
|
|
|
Abort deployment if the filter would delete ALL processes and there are more than 3 processes total:
|
|
|
|
```javascript
|
|
const totalProcesses = list.length;
|
|
if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
|
|
console.error('SAFETY ABORT: Filter would delete ALL processes!');
|
|
console.error(
|
|
'Total processes: ' + totalProcesses + ', Target processes: ' + targetProcesses.length,
|
|
);
|
|
process.exit(1);
|
|
}
|
|
```
|
|
|
|
**Purpose**: Catches filter bugs or unexpected conditions automatically.
|
|
|
|
**Threshold Rationale**: A threshold of 3 allows normal operation when only the expected processes exist (API, Worker, Analytics Worker) while catching anomalies when the server hosts additional applications.
|
|
|
|
#### Layer 4: Explicit Name Verification
|
|
|
|
Log the exact name, status, and PM2 ID of each process that will be affected:
|
|
|
|
```javascript
|
|
console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
|
|
targetProcesses.forEach((p) => {
|
|
console.log(
|
|
' - ' + p.name + ' (status: ' + p.pm2_env.status + ', pm_id: ' + p.pm2_env.pm_id + ')',
|
|
);
|
|
});
|
|
```
|
|
|
|
**Purpose**: Provides clear visibility into cleanup operations.
|
|
|
|
#### Layer 5: Post-Cleanup Verification
|
|
|
|
After cleanup, verify environment isolation was maintained:
|
|
|
|
```bash
|
|
echo "=== POST-CLEANUP VERIFICATION ==="
|
|
pm2 jlist | node -e "
|
|
const list = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
|
|
const prodProcesses = list.filter(p => p.name && p.name.startsWith('flyer-crawler-') && !p.name.endsWith('-test'));
|
|
console.log('Production processes after cleanup: ' + prodProcesses.length);
|
|
"
|
|
echo "=== END POST-CLEANUP VERIFICATION ==="
|
|
```
|
|
|
|
**Purpose**: Immediately identifies cross-environment contamination.
|
|
|
|
#### Layer 6: PM2 Process List Persistence
|
|
|
|
**CRITICAL**: Save the PM2 process list after every state-changing operation:
|
|
|
|
```bash
|
|
# After any pm2 start/stop/restart/delete operation
|
|
pm2 save
|
|
|
|
# Example: After cleanup loop completes
|
|
targetProcesses.forEach(p => {
|
|
exec('pm2 delete ' + p.pm2_env.pm_id);
|
|
});
|
|
exec('pm2 save'); // Persist all deletions
|
|
```
|
|
|
|
**Purpose**: Ensures PM2 process state persists across daemon restarts, server reboots, and internal reconciliation events.
|
|
|
|
**Why This Matters**: PM2 maintains an in-memory process list. Without `pm2 save`, processes become ephemeral:
|
|
|
|
- Daemon restart → All unsaved processes disappear
|
|
- Server reboot → Process list reverts to last saved state
|
|
- PM2 internal reconciliation → Unsaved processes may be lost
|
|
|
|
**Pattern**: Every `pm2 start`, `pm2 restart`, `pm2 stop`, or `pm2 delete` MUST be followed by `pm2 save`.
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
1. **Automatic Prevention**: Layer 3 (process count validation) can prevent catastrophic process deletion automatically, without human intervention.
|
|
|
|
2. **Forensic Capability**: Layers 1 and 2 provide the data needed to determine root cause after an incident.
|
|
|
|
3. **Visibility**: Layers 4 and 5 make PM2 operations transparent in workflow logs.
|
|
|
|
4. **Fail-Safe Design**: Even if individual layers fail, other layers provide backup protection.
|
|
|
|
5. **Non-Breaking**: Safeguards are additive and do not change the existing filtering logic.
|
|
|
|
### Negative
|
|
|
|
1. **Increased Log Volume**: Additional logging increases workflow output size.
|
|
|
|
2. **Minor Performance Impact**: Extra PM2 commands add a few seconds to deployment time.
|
|
|
|
3. **Threshold Tuning**: The threshold of 3 may need adjustment if the expected process count changes.
|
|
|
|
### Neutral
|
|
|
|
1. **Root Cause Still Unknown**: These safeguards mitigate the risk but do not definitively explain why the original incident occurred.
|
|
|
|
2. **No Structural Changes**: The underlying architecture (shared PM2 daemon) remains unchanged.
|
|
|
|
## Alternatives Considered
|
|
|
|
### PM2 Namespaces
|
|
|
|
PM2 supports namespaces to isolate groups of processes. This would provide complete isolation but requires:
|
|
|
|
- Changes to ecosystem config files
|
|
- Changes to all PM2 commands in workflows
|
|
- Potential breaking changes to monitoring and log aggregation
|
|
|
|
**Decision**: Deferred for future consideration. Current safeguards provide adequate protection.
|
|
|
|
### Separate PM2 Daemons
|
|
|
|
Running a separate PM2 daemon per application would eliminate cross-application risk entirely.
|
|
|
|
**Decision**: Not implemented due to increased operational complexity and the current safeguards being sufficient.
|
|
|
|
### Deployment Locks
|
|
|
|
Implementing mutex-style locks to prevent concurrent deployments could prevent race conditions.
|
|
|
|
**Decision**: Not implemented as the current safeguards address the identified risk. May be reconsidered if concurrent deployment issues are observed.
|
|
|
|
## Implementation
|
|
|
|
### Files Modified
|
|
|
|
| File | Changes |
|
|
| ------------------------------------------ | ---------------------- |
|
|
| `.gitea/workflows/deploy-to-prod.yml` | All 5 safeguard layers |
|
|
| `.gitea/workflows/deploy-to-test.yml` | All 5 safeguard layers |
|
|
| `.gitea/workflows/manual-deploy-major.yml` | All 5 safeguard layers |
|
|
|
|
### Validation
|
|
|
|
A standalone test file validates the safeguard logic:
|
|
|
|
- **File**: `tests/qa/test-pm2-safeguard-logic.js`
|
|
- **Coverage**: 11 scenarios covering normal operations and dangerous edge cases
|
|
- **Result**: All tests pass
|
|
|
|
## Related Documentation
|
|
|
|
- [Incident Report: 2026-02-17](../operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md)
|
|
- [PM2 Incident Response Runbook](../operations/PM2-INCIDENT-RESPONSE.md)
|
|
- [Session Summary](../archive/sessions/PM2_SAFEGUARDS_SESSION_2026-02-17.md)
|
|
- [CLAUDE.md - PM2 Process Isolation](../../CLAUDE.md#pm2-process-isolation-productiontest-servers)
|
|
- [ADR-014: Containerization and Deployment Strategy](0014-containerization-and-deployment-strategy.md)
|
|
|
|
## References
|
|
|
|
- PM2 Documentation: https://pm2.keymetrics.io/docs/usage/application-declaration/
|
|
- Defense in Depth: https://en.wikipedia.org/wiki/Defense_in_depth_(computing)
|