PM2 Process Isolation
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s

This commit is contained in:
2026-02-17 20:46:28 -08:00
parent 93ad624658
commit c059b30201
11 changed files with 2228 additions and 7 deletions

View File

@@ -0,0 +1,377 @@
# PM2 Process Isolation Safeguards Project
**Session Date**: 2026-02-17
**Status**: Completed
**Triggered By**: Critical production incident during v0.15.0 deployment
---
## Executive Summary
On 2026-02-17, a critical incident occurred during v0.15.0 production deployment where ALL PM2 processes on the production server were killed, not just the flyer-crawler processes. This caused unplanned downtime for multiple applications including `stock-alert.projectium.com`.
Despite PM2 process isolation fixes already being in place (commit `b6a62a0`), the incident still occurred. Investigation suggests the Gitea runner may have executed a cached/older version of the workflow files. In response, we implemented a comprehensive defense-in-depth strategy with 5 layers of safeguards across all deployment workflows.
---
## Incident Background
### What Happened
| Aspect | Detail |
| --------------------- | ------------------------------------------------------- |
| **Date/Time** | 2026-02-17 ~07:40 UTC |
| **Trigger** | v0.15.0 production deployment via `deploy-to-prod.yml` |
| **Impact** | ALL PM2 processes killed (all environments) |
| **Collateral Damage** | `stock-alert.projectium.com` and other PM2-managed apps |
| **Severity** | P1 - Critical |
### Key Mystery
The PM2 process isolation fix was already implemented in commit `b6a62a0` (2026-02-13) and was included in v0.15.0. The fix correctly used whitelist-based filtering:
```javascript
const prodProcesses = [
'flyer-crawler-api',
'flyer-crawler-worker',
'flyer-crawler-analytics-worker',
];
list.forEach((p) => {
if (
(p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
prodProcesses.includes(p.name)
) {
exec('pm2 delete ' + p.pm2_env.pm_id);
}
});
```
**Hypothesis**: Gitea runner executed a cached older version of the workflow file that did not contain the fix.
---
## Solution: Defense-in-Depth Safeguards
Rather than relying solely on the filter logic (which may be correct but not executed), we implemented 5 layers of safeguards that provide visibility, validation, and automatic abort capabilities.
### Safeguard Layers
| Layer | Name | Purpose |
| ----- | --------------------------------- | ------------------------------------------------------- |
| 1 | **Workflow Metadata Logging** | Audit trail of which workflow version actually executed |
| 2 | **Pre-Cleanup PM2 State Logging** | Capture full process list before any modifications |
| 3 | **Process Count Validation** | SAFETY ABORT if filter would delete ALL processes |
| 4 | **Explicit Name Verification** | Log exactly which processes will be affected |
| 5 | **Post-Cleanup Verification** | Verify environment isolation after cleanup |
### Layer Details
#### Layer 1: Workflow Metadata Logging
Logs at the start of deployment:
- Workflow file name
- SHA-256 hash of the workflow file
- Git commit being deployed
- Git branch
- Timestamp (UTC)
- Actor (who triggered the deployment)
**Purpose**: If an incident occurs, we can verify whether the executed workflow matches the repository version.
```bash
echo "=== WORKFLOW METADATA ==="
echo "Workflow file: deploy-to-prod.yml"
echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
echo "Git commit: $(git rev-parse HEAD)"
echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo "Actor: ${{ gitea.actor }}"
echo "=== END METADATA ==="
```
#### Layer 2: Pre-Cleanup PM2 State Logging
Captures full PM2 process list in JSON format before any modifications.
**Purpose**: Provides forensic evidence of what processes existed before cleanup began.
```bash
echo "=== PRE-CLEANUP PM2 STATE ==="
pm2 jlist
echo "=== END PRE-CLEANUP STATE ==="
```
#### Layer 3: Process Count Validation (SAFETY ABORT)
The most critical safeguard. Aborts the entire deployment if the filter would delete ALL processes and there are more than 3 processes total.
**Purpose**: Catches filter bugs or unexpected conditions that would result in catastrophic process deletion.
```javascript
// SAFEGUARD 1: Process count validation
const totalProcesses = list.length;
if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
console.error('SAFETY ABORT: Filter would delete ALL processes!');
console.error(
'Total processes: ' + totalProcesses + ', Target processes: ' + targetProcesses.length,
);
console.error('This indicates a potential filter bug. Aborting cleanup.');
process.exit(1);
}
```
**Threshold Rationale**: The threshold of 3 allows normal operation when only the 3 expected processes exist (API, Worker, Analytics Worker) while catching anomalies when the server hosts more applications.
#### Layer 4: Explicit Name Verification
Logs the exact name, status, and PM2 ID of each process that will be deleted.
**Purpose**: Provides clear visibility into what the cleanup operation will actually do.
```javascript
console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
targetProcesses.forEach((p) => {
console.log(
' - ' + p.name + ' (status: ' + p.pm2_env.status + ', pm_id: ' + p.pm2_env.pm_id + ')',
);
});
```
#### Layer 5: Post-Cleanup Verification
After cleanup, logs the state of processes by environment to verify isolation was maintained.
**Purpose**: Immediately identifies if the cleanup affected the wrong environment.
```bash
echo "=== POST-CLEANUP VERIFICATION ==="
pm2 jlist | node -e "
const list = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
const prodProcesses = list.filter(p => p.name && p.name.startsWith('flyer-crawler-') && !p.name.endsWith('-test'));
const testProcesses = list.filter(p => p.name && p.name.endsWith('-test'));
console.log('Production processes after cleanup: ' + prodProcesses.length);
console.log('Test processes (should be untouched): ' + testProcesses.length);
"
echo "=== END POST-CLEANUP VERIFICATION ==="
```
---
## Implementation Details
### Files Modified
| File | Changes |
| ------------------------------------------ | --------------------------------------------- |
| `.gitea/workflows/deploy-to-prod.yml` | Added all 5 safeguard layers |
| `.gitea/workflows/deploy-to-test.yml` | Added all 5 safeguard layers |
| `.gitea/workflows/manual-deploy-major.yml` | Added all 5 safeguard layers |
| `CLAUDE.md` | Added PM2 Process Isolation Incidents section |
### Files Created
| File | Purpose |
| --------------------------------------------------------- | --------------------------------------- |
| `docs/operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md` | Detailed incident report |
| `docs/operations/PM2-INCIDENT-RESPONSE.md` | Comprehensive incident response runbook |
| `tests/qa/test-pm2-safeguard-logic.js` | Validation tests for safeguard logic |
---
## Testing and Validation
### Test Artifact
A standalone JavaScript test file was created to validate the safeguard logic:
**File**: `tests/qa/test-pm2-safeguard-logic.js`
**Test Categories**:
1. **Normal Operations (should NOT abort)**
- 3 errored out of 15 processes
- 1 errored out of 10 processes
- 0 processes to clean
- Fresh server with 3 processes (threshold boundary)
2. **Dangerous Operations (SHOULD abort)**
- All 10 processes targeted
- All 15 processes targeted
- All 4 processes targeted (just above threshold)
3. **Workflow-Specific Filter Tests**
- Production filter only matches production processes
- Test filter only matches `-test` suffix processes
- Filters don't cross-contaminate environments
### Test Results
All 11 scenarios passed:
| Scenario | Total | Target | Expected | Result |
| -------------------------- | ----- | ------ | -------- | ------ |
| Normal prod cleanup | 15 | 3 | No abort | PASS |
| Normal test cleanup | 15 | 3 | No abort | PASS |
| Single process | 10 | 1 | No abort | PASS |
| No cleanup needed | 10 | 0 | No abort | PASS |
| Fresh server (threshold) | 3 | 3 | No abort | PASS |
| Minimal server | 2 | 2 | No abort | PASS |
| Empty PM2 | 0 | 0 | No abort | PASS |
| Filter bug - 10 processes | 10 | 10 | ABORT | PASS |
| Filter bug - 15 processes | 15 | 15 | ABORT | PASS |
| Filter bug - 4 processes | 4 | 4 | ABORT | PASS |
| Filter bug - 100 processes | 100 | 100 | ABORT | PASS |
### YAML Validation
All workflow files passed YAML syntax validation using `python -c "import yaml; yaml.safe_load(open(...))"`
---
## Documentation Updates
### CLAUDE.md Updates
Added new section at line 293: **PM2 Process Isolation Incidents**
Contains:
- Reference to the 2026-02-17 incident
- Impact summary
- Prevention measures list
- Response instructions
- Links to related documentation
### docs/README.md
Added incident report reference under **Operations > Incident Reports**.
### Cross-References Verified
| Document | Reference | Status |
| --------------- | --------------------------------------- | ------ |
| CLAUDE.md | PM2-INCIDENT-RESPONSE.md | Valid |
| CLAUDE.md | INCIDENT-2026-02-17-PM2-PROCESS-KILL.md | Valid |
| Incident Report | CLAUDE.md PM2 section | Valid |
| Incident Report | PM2-INCIDENT-RESPONSE.md | Valid |
| docs/README.md | INCIDENT-2026-02-17-PM2-PROCESS-KILL.md | Valid |
---
## Lessons Learned
### Technical Lessons
1. **Filter logic alone is not sufficient** - Even correct filters can be bypassed if an older version of the script is executed.
2. **Workflow caching is a real risk** - CI/CD runners may cache workflow files, leading to stale versions being executed.
3. **Defense-in-depth is essential for destructive operations** - Multiple layers of validation catch failures that single-point checks miss.
4. **Visibility enables diagnosis** - Pre/post state logging makes root cause analysis possible.
5. **Automatic abort prevents cascading failures** - The process count validation could have prevented the incident entirely.
### Process Lessons
1. **Shared PM2 daemons are risky** - Multiple applications sharing a PM2 daemon create cross-application dependencies.
2. **Documentation should include failure modes** - CLAUDE.md now explicitly documents what can go wrong and how to respond.
3. **Runbooks save time during incidents** - The incident response runbook provides step-by-step guidance when time is critical.
---
## Future Considerations
### Not Implemented (Potential Future Work)
1. **PM2 Namespacing** - Use PM2's native namespace feature to completely isolate environments.
2. **Separate PM2 Daemons** - Run one PM2 daemon per application to eliminate cross-application risk.
3. **Deployment Locks** - Implement mutex-style locks to prevent concurrent deployments.
4. **Workflow Version Verification** - Add a pre-flight check that compares workflow hash against expected value.
5. **Automated Rollback** - Implement automatic process restoration if safeguards detect a problem.
---
## Related Documentation
- **ADR-061**: [PM2 Process Isolation Safeguards](../../adr/0061-pm2-process-isolation-safeguards.md)
- **Incident Report**: [INCIDENT-2026-02-17-PM2-PROCESS-KILL.md](../../operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md)
- **Response Runbook**: [PM2-INCIDENT-RESPONSE.md](../../operations/PM2-INCIDENT-RESPONSE.md)
- **CLAUDE.md Section**: [PM2 Process Isolation Incidents](../../../CLAUDE.md#pm2-process-isolation-incidents)
- **Test Artifact**: [test-pm2-safeguard-logic.js](../../../tests/qa/test-pm2-safeguard-logic.js)
- **ADR-014**: [Containerization and Deployment Strategy](../../adr/0014-containerization-and-deployment-strategy.md)
---
## Appendix: Workflow Changes Summary
### deploy-to-prod.yml
```diff
+ - name: Log Workflow Metadata
+ run: |
+ echo "=== WORKFLOW METADATA ==="
+ echo "Workflow file: deploy-to-prod.yml"
+ echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
+ ...
- name: Install Backend Dependencies and Restart Production Server
run: |
+ # === PRE-CLEANUP PM2 STATE LOGGING ===
+ echo "=== PRE-CLEANUP PM2 STATE ==="
+ pm2 jlist
+ echo "=== END PRE-CLEANUP STATE ==="
+
# --- Cleanup Errored Processes with Defense-in-Depth Safeguards ---
node -e "
...
+ // SAFEGUARD 1: Process count validation
+ if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
+ console.error('SAFETY ABORT: Filter would delete ALL processes!');
+ process.exit(1);
+ }
+
+ // SAFEGUARD 2: Explicit name verification
+ console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
+ targetProcesses.forEach(p => {
+ console.log(' - ' + p.name + ' (status: ' + p.pm2_env.status + ')');
+ });
...
"
+
+ # === POST-CLEANUP VERIFICATION ===
+ echo "=== POST-CLEANUP VERIFICATION ==="
+ pm2 jlist | node -e "..."
+ echo "=== END POST-CLEANUP VERIFICATION ==="
```
Similar changes were applied to `deploy-to-test.yml` and `manual-deploy-major.yml`.
---
## Session Participants
| Role | Agent Type | Responsibility |
| ------------ | ------------------------- | ------------------------------------- |
| Orchestrator | Main Claude | Session coordination and delegation |
| Planner | planner subagent | Incident analysis and solution design |
| Documenter | describer-for-ai subagent | Incident report creation |
| Coder #1 | coder subagent | Workflow safeguard implementation |
| Coder #2 | coder subagent | Incident response runbook creation |
| Coder #3 | coder subagent | CLAUDE.md updates |
| Tester | tester subagent | Comprehensive validation |
| Archivist | Lead Technical Archivist | Final documentation |
---
## Revision History
| Date | Author | Change |
| ---------- | ------------------------ | ----------------------- |
| 2026-02-17 | Lead Technical Archivist | Initial session summary |