PM2 Process Isolation
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s
This commit is contained in:
199
docs/adr/0061-pm2-process-isolation-safeguards.md
Normal file
199
docs/adr/0061-pm2-process-isolation-safeguards.md
Normal file
@@ -0,0 +1,199 @@
|
||||
# ADR-061: PM2 Process Isolation Safeguards
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
On 2026-02-17, a critical incident occurred during v0.15.0 production deployment where ALL PM2 processes on the production server were terminated, not just flyer-crawler processes. This caused unplanned downtime for multiple applications including `stock-alert.projectium.com`.
|
||||
|
||||
### Problem Statement
|
||||
|
||||
Production and test environments share the same PM2 daemon on the server. This creates a risk where deployment scripts that operate on PM2 processes can accidentally affect processes belonging to other applications or environments.
|
||||
|
||||
### Pre-existing Controls
|
||||
|
||||
Prior to the incident, PM2 process isolation controls were already in place (commit `b6a62a0`):
|
||||
|
||||
- Production workflows used whitelist-based filtering with explicit process names
|
||||
- Test workflows filtered by `-test` suffix pattern
|
||||
- CLAUDE.md documented the prohibition of `pm2 stop all`, `pm2 delete all`, and `pm2 restart all`
|
||||
|
||||
Despite these controls being present in the codebase and included in v0.15.0, the incident still occurred. The leading hypothesis is that the Gitea runner executed a cached/older version of the workflow file.
|
||||
|
||||
### Requirements
|
||||
|
||||
1. Prevent accidental deletion of processes from other applications or environments
|
||||
2. Provide audit trail for forensic analysis when incidents occur
|
||||
3. Enable automatic abort when dangerous conditions are detected
|
||||
4. Maintain visibility into PM2 operations during deployment
|
||||
5. Work correctly even if the filtering logic itself is bypassed
|
||||
|
||||
## Decision
|
||||
|
||||
Implement a defense-in-depth strategy with 5 layers of safeguards in all deployment workflows that interact with PM2 processes.
|
||||
|
||||
### Safeguard Layers
|
||||
|
||||
#### Layer 1: Workflow Metadata Logging
|
||||
|
||||
Log workflow execution metadata at the start of each deployment:
|
||||
|
||||
```bash
|
||||
echo "=== WORKFLOW METADATA ==="
|
||||
echo "Workflow file: deploy-to-prod.yml"
|
||||
echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
|
||||
echo "Git commit: $(git rev-parse HEAD)"
|
||||
echo "Git branch: $(git rev-parse --abbrev-ref HEAD)"
|
||||
echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
|
||||
echo "Actor: ${{ gitea.actor }}"
|
||||
echo "=== END METADATA ==="
|
||||
```
|
||||
|
||||
**Purpose**: Enables verification of which workflow version was actually executed.
|
||||
|
||||
#### Layer 2: Pre-Cleanup PM2 State Logging
|
||||
|
||||
Capture full PM2 process list before any modifications:
|
||||
|
||||
```bash
|
||||
echo "=== PRE-CLEANUP PM2 STATE ==="
|
||||
pm2 jlist
|
||||
echo "=== END PRE-CLEANUP STATE ==="
|
||||
```
|
||||
|
||||
**Purpose**: Provides forensic evidence of system state before cleanup.
|
||||
|
||||
#### Layer 3: Process Count Validation (SAFETY ABORT)
|
||||
|
||||
Abort deployment if the filter would delete ALL processes and there are more than 3 processes total:
|
||||
|
||||
```javascript
|
||||
const totalProcesses = list.length;
|
||||
if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
|
||||
console.error('SAFETY ABORT: Filter would delete ALL processes!');
|
||||
console.error(
|
||||
'Total processes: ' + totalProcesses + ', Target processes: ' + targetProcesses.length,
|
||||
);
|
||||
process.exit(1);
|
||||
}
|
||||
```
|
||||
|
||||
**Purpose**: Catches filter bugs or unexpected conditions automatically.
|
||||
|
||||
**Threshold Rationale**: A threshold of 3 allows normal operation when only the expected processes exist (API, Worker, Analytics Worker) while catching anomalies when the server hosts additional applications.
|
||||
|
||||
#### Layer 4: Explicit Name Verification
|
||||
|
||||
Log the exact name, status, and PM2 ID of each process that will be affected:
|
||||
|
||||
```javascript
|
||||
console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
|
||||
targetProcesses.forEach((p) => {
|
||||
console.log(
|
||||
' - ' + p.name + ' (status: ' + p.pm2_env.status + ', pm_id: ' + p.pm2_env.pm_id + ')',
|
||||
);
|
||||
});
|
||||
```
|
||||
|
||||
**Purpose**: Provides clear visibility into cleanup operations.
|
||||
|
||||
#### Layer 5: Post-Cleanup Verification
|
||||
|
||||
After cleanup, verify environment isolation was maintained:
|
||||
|
||||
```bash
|
||||
echo "=== POST-CLEANUP VERIFICATION ==="
|
||||
pm2 jlist | node -e "
|
||||
const list = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
|
||||
const prodProcesses = list.filter(p => p.name && p.name.startsWith('flyer-crawler-') && !p.name.endsWith('-test'));
|
||||
console.log('Production processes after cleanup: ' + prodProcesses.length);
|
||||
"
|
||||
echo "=== END POST-CLEANUP VERIFICATION ==="
|
||||
```
|
||||
|
||||
**Purpose**: Immediately identifies cross-environment contamination.
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
1. **Automatic Prevention**: Layer 3 (process count validation) can prevent catastrophic process deletion automatically, without human intervention.
|
||||
|
||||
2. **Forensic Capability**: Layers 1 and 2 provide the data needed to determine root cause after an incident.
|
||||
|
||||
3. **Visibility**: Layers 4 and 5 make PM2 operations transparent in workflow logs.
|
||||
|
||||
4. **Fail-Safe Design**: Even if individual layers fail, other layers provide backup protection.
|
||||
|
||||
5. **Non-Breaking**: Safeguards are additive and do not change the existing filtering logic.
|
||||
|
||||
### Negative
|
||||
|
||||
1. **Increased Log Volume**: Additional logging increases workflow output size.
|
||||
|
||||
2. **Minor Performance Impact**: Extra PM2 commands add a few seconds to deployment time.
|
||||
|
||||
3. **Threshold Tuning**: The threshold of 3 may need adjustment if the expected process count changes.
|
||||
|
||||
### Neutral
|
||||
|
||||
1. **Root Cause Still Unknown**: These safeguards mitigate the risk but do not definitively explain why the original incident occurred.
|
||||
|
||||
2. **No Structural Changes**: The underlying architecture (shared PM2 daemon) remains unchanged.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### PM2 Namespaces
|
||||
|
||||
PM2 supports namespaces to isolate groups of processes. This would provide complete isolation but requires:
|
||||
|
||||
- Changes to ecosystem config files
|
||||
- Changes to all PM2 commands in workflows
|
||||
- Potential breaking changes to monitoring and log aggregation
|
||||
|
||||
**Decision**: Deferred for future consideration. Current safeguards provide adequate protection.
|
||||
|
||||
### Separate PM2 Daemons
|
||||
|
||||
Running a separate PM2 daemon per application would eliminate cross-application risk entirely.
|
||||
|
||||
**Decision**: Not implemented due to increased operational complexity and the current safeguards being sufficient.
|
||||
|
||||
### Deployment Locks
|
||||
|
||||
Implementing mutex-style locks to prevent concurrent deployments could prevent race conditions.
|
||||
|
||||
**Decision**: Not implemented as the current safeguards address the identified risk. May be reconsidered if concurrent deployment issues are observed.
|
||||
|
||||
## Implementation
|
||||
|
||||
### Files Modified
|
||||
|
||||
| File | Changes |
|
||||
| ------------------------------------------ | ---------------------- |
|
||||
| `.gitea/workflows/deploy-to-prod.yml` | All 5 safeguard layers |
|
||||
| `.gitea/workflows/deploy-to-test.yml` | All 5 safeguard layers |
|
||||
| `.gitea/workflows/manual-deploy-major.yml` | All 5 safeguard layers |
|
||||
|
||||
### Validation
|
||||
|
||||
A standalone test file validates the safeguard logic:
|
||||
|
||||
- **File**: `tests/qa/test-pm2-safeguard-logic.js`
|
||||
- **Coverage**: 11 scenarios covering normal operations and dangerous edge cases
|
||||
- **Result**: All tests pass
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Incident Report: 2026-02-17](../operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md)
|
||||
- [PM2 Incident Response Runbook](../operations/PM2-INCIDENT-RESPONSE.md)
|
||||
- [Session Summary](../archive/sessions/PM2_SAFEGUARDS_SESSION_2026-02-17.md)
|
||||
- [CLAUDE.md - PM2 Process Isolation](../../CLAUDE.md#pm2-process-isolation-productiontest-servers)
|
||||
- [ADR-014: Containerization and Deployment Strategy](0014-containerization-and-deployment-strategy.md)
|
||||
|
||||
## References
|
||||
|
||||
- PM2 Documentation: https://pm2.keymetrics.io/docs/usage/application-declaration/
|
||||
- Defense in Depth: https://en.wikipedia.org/wiki/Defense_in_depth_(computing)
|
||||
@@ -56,6 +56,7 @@ This directory contains a log of the architectural decisions made for the Flyer
|
||||
**[ADR-038](./0038-graceful-shutdown-pattern.md)**: Graceful Shutdown Pattern (Accepted)
|
||||
**[ADR-053](./0053-worker-health-checks.md)**: Worker Health Checks and Stalled Job Monitoring (Accepted)
|
||||
**[ADR-054](./0054-bugsink-gitea-issue-sync.md)**: Bugsink to Gitea Issue Synchronization (Proposed)
|
||||
**[ADR-061](./0061-pm2-process-isolation-safeguards.md)**: PM2 Process Isolation Safeguards (Accepted)
|
||||
|
||||
## 7. Frontend / User Interface
|
||||
|
||||
|
||||
Reference in New Issue
Block a user