PM2 Process Isolation
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s

This commit is contained in:
2026-02-17 20:46:28 -08:00
parent 93ad624658
commit c059b30201
11 changed files with 2228 additions and 7 deletions

View File

@@ -0,0 +1,199 @@
# ADR-061: PM2 Process Isolation Safeguards
## Status
Accepted
## Context
On 2026-02-17, a critical incident occurred during v0.15.0 production deployment where ALL PM2 processes on the production server were terminated, not just flyer-crawler processes. This caused unplanned downtime for multiple applications including `stock-alert.projectium.com`.
### Problem Statement
Production and test environments share the same PM2 daemon on the server. This creates a risk where deployment scripts that operate on PM2 processes can accidentally affect processes belonging to other applications or environments.
### Pre-existing Controls
Prior to the incident, PM2 process isolation controls were already in place (commit `b6a62a0`):
- Production workflows used whitelist-based filtering with explicit process names
- Test workflows filtered by `-test` suffix pattern
- CLAUDE.md documented the prohibition of `pm2 stop all`, `pm2 delete all`, and `pm2 restart all`
Despite these controls being present in the codebase and included in v0.15.0, the incident still occurred. The leading hypothesis is that the Gitea runner executed a cached/older version of the workflow file.
### Requirements
1. Prevent accidental deletion of processes from other applications or environments
2. Provide audit trail for forensic analysis when incidents occur
3. Enable automatic abort when dangerous conditions are detected
4. Maintain visibility into PM2 operations during deployment
5. Work correctly even if the filtering logic itself is bypassed
## Decision
Implement a defense-in-depth strategy with 5 layers of safeguards in all deployment workflows that interact with PM2 processes.
### Safeguard Layers
#### Layer 1: Workflow Metadata Logging
Log workflow execution metadata at the start of each deployment:
```bash
echo "=== WORKFLOW METADATA ==="
echo "Workflow file: deploy-to-prod.yml"
echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
echo "Git commit: $(git rev-parse HEAD)"
echo "Git branch: $(git rev-parse --abbrev-ref HEAD)"
echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo "Actor: ${{ gitea.actor }}"
echo "=== END METADATA ==="
```
**Purpose**: Enables verification of which workflow version was actually executed.
#### Layer 2: Pre-Cleanup PM2 State Logging
Capture full PM2 process list before any modifications:
```bash
echo "=== PRE-CLEANUP PM2 STATE ==="
pm2 jlist
echo "=== END PRE-CLEANUP STATE ==="
```
**Purpose**: Provides forensic evidence of system state before cleanup.
#### Layer 3: Process Count Validation (SAFETY ABORT)
Abort deployment if the filter would delete ALL processes and there are more than 3 processes total:
```javascript
const totalProcesses = list.length;
if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
console.error('SAFETY ABORT: Filter would delete ALL processes!');
console.error(
'Total processes: ' + totalProcesses + ', Target processes: ' + targetProcesses.length,
);
process.exit(1);
}
```
**Purpose**: Catches filter bugs or unexpected conditions automatically.
**Threshold Rationale**: A threshold of 3 allows normal operation when only the expected processes exist (API, Worker, Analytics Worker) while catching anomalies when the server hosts additional applications.
#### Layer 4: Explicit Name Verification
Log the exact name, status, and PM2 ID of each process that will be affected:
```javascript
console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
targetProcesses.forEach((p) => {
console.log(
' - ' + p.name + ' (status: ' + p.pm2_env.status + ', pm_id: ' + p.pm2_env.pm_id + ')',
);
});
```
**Purpose**: Provides clear visibility into cleanup operations.
#### Layer 5: Post-Cleanup Verification
After cleanup, verify environment isolation was maintained:
```bash
echo "=== POST-CLEANUP VERIFICATION ==="
pm2 jlist | node -e "
const list = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
const prodProcesses = list.filter(p => p.name && p.name.startsWith('flyer-crawler-') && !p.name.endsWith('-test'));
console.log('Production processes after cleanup: ' + prodProcesses.length);
"
echo "=== END POST-CLEANUP VERIFICATION ==="
```
**Purpose**: Immediately identifies cross-environment contamination.
## Consequences
### Positive
1. **Automatic Prevention**: Layer 3 (process count validation) can prevent catastrophic process deletion automatically, without human intervention.
2. **Forensic Capability**: Layers 1 and 2 provide the data needed to determine root cause after an incident.
3. **Visibility**: Layers 4 and 5 make PM2 operations transparent in workflow logs.
4. **Fail-Safe Design**: Even if individual layers fail, other layers provide backup protection.
5. **Non-Breaking**: Safeguards are additive and do not change the existing filtering logic.
### Negative
1. **Increased Log Volume**: Additional logging increases workflow output size.
2. **Minor Performance Impact**: Extra PM2 commands add a few seconds to deployment time.
3. **Threshold Tuning**: The threshold of 3 may need adjustment if the expected process count changes.
### Neutral
1. **Root Cause Still Unknown**: These safeguards mitigate the risk but do not definitively explain why the original incident occurred.
2. **No Structural Changes**: The underlying architecture (shared PM2 daemon) remains unchanged.
## Alternatives Considered
### PM2 Namespaces
PM2 supports namespaces to isolate groups of processes. This would provide complete isolation but requires:
- Changes to ecosystem config files
- Changes to all PM2 commands in workflows
- Potential breaking changes to monitoring and log aggregation
**Decision**: Deferred for future consideration. Current safeguards provide adequate protection.
### Separate PM2 Daemons
Running a separate PM2 daemon per application would eliminate cross-application risk entirely.
**Decision**: Not implemented due to increased operational complexity and the current safeguards being sufficient.
### Deployment Locks
Implementing mutex-style locks to prevent concurrent deployments could prevent race conditions.
**Decision**: Not implemented as the current safeguards address the identified risk. May be reconsidered if concurrent deployment issues are observed.
## Implementation
### Files Modified
| File | Changes |
| ------------------------------------------ | ---------------------- |
| `.gitea/workflows/deploy-to-prod.yml` | All 5 safeguard layers |
| `.gitea/workflows/deploy-to-test.yml` | All 5 safeguard layers |
| `.gitea/workflows/manual-deploy-major.yml` | All 5 safeguard layers |
### Validation
A standalone test file validates the safeguard logic:
- **File**: `tests/qa/test-pm2-safeguard-logic.js`
- **Coverage**: 11 scenarios covering normal operations and dangerous edge cases
- **Result**: All tests pass
## Related Documentation
- [Incident Report: 2026-02-17](../operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md)
- [PM2 Incident Response Runbook](../operations/PM2-INCIDENT-RESPONSE.md)
- [Session Summary](../archive/sessions/PM2_SAFEGUARDS_SESSION_2026-02-17.md)
- [CLAUDE.md - PM2 Process Isolation](../../CLAUDE.md#pm2-process-isolation-productiontest-servers)
- [ADR-014: Containerization and Deployment Strategy](0014-containerization-and-deployment-strategy.md)
## References
- PM2 Documentation: https://pm2.keymetrics.io/docs/usage/application-declaration/
- Defense in Depth: https://en.wikipedia.org/wiki/Defense_in_depth_(computing)

View File

@@ -56,6 +56,7 @@ This directory contains a log of the architectural decisions made for the Flyer
**[ADR-038](./0038-graceful-shutdown-pattern.md)**: Graceful Shutdown Pattern (Accepted)
**[ADR-053](./0053-worker-health-checks.md)**: Worker Health Checks and Stalled Job Monitoring (Accepted)
**[ADR-054](./0054-bugsink-gitea-issue-sync.md)**: Bugsink to Gitea Issue Synchronization (Proposed)
**[ADR-061](./0061-pm2-process-isolation-safeguards.md)**: PM2 Process Isolation Safeguards (Accepted)
## 7. Frontend / User Interface