15 KiB
PM2 Process Isolation Safeguards Project
Session Date: 2026-02-17 Status: Completed Triggered By: Critical production incident during v0.15.0 deployment
Executive Summary
On 2026-02-17, a critical incident occurred during v0.15.0 production deployment where ALL PM2 processes on the production server were killed, not just the flyer-crawler processes. This caused unplanned downtime for multiple applications including stock-alert.projectium.com.
Despite PM2 process isolation fixes already being in place (commit b6a62a0), the incident still occurred. Investigation suggests the Gitea runner may have executed a cached/older version of the workflow files. In response, we implemented a comprehensive defense-in-depth strategy with 5 layers of safeguards across all deployment workflows.
Incident Background
What Happened
| Aspect | Detail |
|---|---|
| Date/Time | 2026-02-17 ~07:40 UTC |
| Trigger | v0.15.0 production deployment via deploy-to-prod.yml |
| Impact | ALL PM2 processes killed (all environments) |
| Collateral Damage | stock-alert.projectium.com and other PM2-managed apps |
| Severity | P1 - Critical |
Key Mystery
The PM2 process isolation fix was already implemented in commit b6a62a0 (2026-02-13) and was included in v0.15.0. The fix correctly used whitelist-based filtering:
const prodProcesses = [
'flyer-crawler-api',
'flyer-crawler-worker',
'flyer-crawler-analytics-worker',
];
list.forEach((p) => {
if (
(p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
prodProcesses.includes(p.name)
) {
exec('pm2 delete ' + p.pm2_env.pm_id);
}
});
Hypothesis: Gitea runner executed a cached older version of the workflow file that did not contain the fix.
Solution: Defense-in-Depth Safeguards
Rather than relying solely on the filter logic (which may be correct but not executed), we implemented 5 layers of safeguards that provide visibility, validation, and automatic abort capabilities.
Safeguard Layers
| Layer | Name | Purpose |
|---|---|---|
| 1 | Workflow Metadata Logging | Audit trail of which workflow version actually executed |
| 2 | Pre-Cleanup PM2 State Logging | Capture full process list before any modifications |
| 3 | Process Count Validation | SAFETY ABORT if filter would delete ALL processes |
| 4 | Explicit Name Verification | Log exactly which processes will be affected |
| 5 | Post-Cleanup Verification | Verify environment isolation after cleanup |
Layer Details
Layer 1: Workflow Metadata Logging
Logs at the start of deployment:
- Workflow file name
- SHA-256 hash of the workflow file
- Git commit being deployed
- Git branch
- Timestamp (UTC)
- Actor (who triggered the deployment)
Purpose: If an incident occurs, we can verify whether the executed workflow matches the repository version.
echo "=== WORKFLOW METADATA ==="
echo "Workflow file: deploy-to-prod.yml"
echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
echo "Git commit: $(git rev-parse HEAD)"
echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo "Actor: ${{ gitea.actor }}"
echo "=== END METADATA ==="
Layer 2: Pre-Cleanup PM2 State Logging
Captures full PM2 process list in JSON format before any modifications.
Purpose: Provides forensic evidence of what processes existed before cleanup began.
echo "=== PRE-CLEANUP PM2 STATE ==="
pm2 jlist
echo "=== END PRE-CLEANUP STATE ==="
Layer 3: Process Count Validation (SAFETY ABORT)
The most critical safeguard. Aborts the entire deployment if the filter would delete ALL processes and there are more than 3 processes total.
Purpose: Catches filter bugs or unexpected conditions that would result in catastrophic process deletion.
// SAFEGUARD 1: Process count validation
const totalProcesses = list.length;
if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
console.error('SAFETY ABORT: Filter would delete ALL processes!');
console.error(
'Total processes: ' + totalProcesses + ', Target processes: ' + targetProcesses.length,
);
console.error('This indicates a potential filter bug. Aborting cleanup.');
process.exit(1);
}
Threshold Rationale: The threshold of 3 allows normal operation when only the 3 expected processes exist (API, Worker, Analytics Worker) while catching anomalies when the server hosts more applications.
Layer 4: Explicit Name Verification
Logs the exact name, status, and PM2 ID of each process that will be deleted.
Purpose: Provides clear visibility into what the cleanup operation will actually do.
console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
targetProcesses.forEach((p) => {
console.log(
' - ' + p.name + ' (status: ' + p.pm2_env.status + ', pm_id: ' + p.pm2_env.pm_id + ')',
);
});
Layer 5: Post-Cleanup Verification
After cleanup, logs the state of processes by environment to verify isolation was maintained.
Purpose: Immediately identifies if the cleanup affected the wrong environment.
echo "=== POST-CLEANUP VERIFICATION ==="
pm2 jlist | node -e "
const list = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
const prodProcesses = list.filter(p => p.name && p.name.startsWith('flyer-crawler-') && !p.name.endsWith('-test'));
const testProcesses = list.filter(p => p.name && p.name.endsWith('-test'));
console.log('Production processes after cleanup: ' + prodProcesses.length);
console.log('Test processes (should be untouched): ' + testProcesses.length);
"
echo "=== END POST-CLEANUP VERIFICATION ==="
Implementation Details
Files Modified
| File | Changes |
|---|---|
.gitea/workflows/deploy-to-prod.yml |
Added all 5 safeguard layers |
.gitea/workflows/deploy-to-test.yml |
Added all 5 safeguard layers |
.gitea/workflows/manual-deploy-major.yml |
Added all 5 safeguard layers |
CLAUDE.md |
Added PM2 Process Isolation Incidents section |
Files Created
| File | Purpose |
|---|---|
docs/operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md |
Detailed incident report |
docs/operations/PM2-INCIDENT-RESPONSE.md |
Comprehensive incident response runbook |
tests/qa/test-pm2-safeguard-logic.js |
Validation tests for safeguard logic |
Testing and Validation
Test Artifact
A standalone JavaScript test file was created to validate the safeguard logic:
File: tests/qa/test-pm2-safeguard-logic.js
Test Categories:
-
Normal Operations (should NOT abort)
- 3 errored out of 15 processes
- 1 errored out of 10 processes
- 0 processes to clean
- Fresh server with 3 processes (threshold boundary)
-
Dangerous Operations (SHOULD abort)
- All 10 processes targeted
- All 15 processes targeted
- All 4 processes targeted (just above threshold)
-
Workflow-Specific Filter Tests
- Production filter only matches production processes
- Test filter only matches
-testsuffix processes - Filters don't cross-contaminate environments
Test Results
All 11 scenarios passed:
| Scenario | Total | Target | Expected | Result |
|---|---|---|---|---|
| Normal prod cleanup | 15 | 3 | No abort | PASS |
| Normal test cleanup | 15 | 3 | No abort | PASS |
| Single process | 10 | 1 | No abort | PASS |
| No cleanup needed | 10 | 0 | No abort | PASS |
| Fresh server (threshold) | 3 | 3 | No abort | PASS |
| Minimal server | 2 | 2 | No abort | PASS |
| Empty PM2 | 0 | 0 | No abort | PASS |
| Filter bug - 10 processes | 10 | 10 | ABORT | PASS |
| Filter bug - 15 processes | 15 | 15 | ABORT | PASS |
| Filter bug - 4 processes | 4 | 4 | ABORT | PASS |
| Filter bug - 100 processes | 100 | 100 | ABORT | PASS |
YAML Validation
All workflow files passed YAML syntax validation using python -c "import yaml; yaml.safe_load(open(...))"
Documentation Updates
CLAUDE.md Updates
Added new section at line 293: PM2 Process Isolation Incidents
Contains:
- Reference to the 2026-02-17 incident
- Impact summary
- Prevention measures list
- Response instructions
- Links to related documentation
docs/README.md
Added incident report reference under Operations > Incident Reports.
Cross-References Verified
| Document | Reference | Status |
|---|---|---|
| CLAUDE.md | PM2-INCIDENT-RESPONSE.md | Valid |
| CLAUDE.md | INCIDENT-2026-02-17-PM2-PROCESS-KILL.md | Valid |
| Incident Report | CLAUDE.md PM2 section | Valid |
| Incident Report | PM2-INCIDENT-RESPONSE.md | Valid |
| docs/README.md | INCIDENT-2026-02-17-PM2-PROCESS-KILL.md | Valid |
Lessons Learned
Technical Lessons
-
Filter logic alone is not sufficient - Even correct filters can be bypassed if an older version of the script is executed.
-
Workflow caching is a real risk - CI/CD runners may cache workflow files, leading to stale versions being executed.
-
Defense-in-depth is essential for destructive operations - Multiple layers of validation catch failures that single-point checks miss.
-
Visibility enables diagnosis - Pre/post state logging makes root cause analysis possible.
-
Automatic abort prevents cascading failures - The process count validation could have prevented the incident entirely.
Process Lessons
-
Shared PM2 daemons are risky - Multiple applications sharing a PM2 daemon create cross-application dependencies.
-
Documentation should include failure modes - CLAUDE.md now explicitly documents what can go wrong and how to respond.
-
Runbooks save time during incidents - The incident response runbook provides step-by-step guidance when time is critical.
Future Considerations
Not Implemented (Potential Future Work)
-
PM2 Namespacing - Use PM2's native namespace feature to completely isolate environments.
-
Separate PM2 Daemons - Run one PM2 daemon per application to eliminate cross-application risk.
-
Deployment Locks - Implement mutex-style locks to prevent concurrent deployments.
-
Workflow Version Verification - Add a pre-flight check that compares workflow hash against expected value.
-
Automated Rollback - Implement automatic process restoration if safeguards detect a problem.
Related Documentation
- ADR-061: PM2 Process Isolation Safeguards
- Incident Report: INCIDENT-2026-02-17-PM2-PROCESS-KILL.md
- Response Runbook: PM2-INCIDENT-RESPONSE.md
- CLAUDE.md Section: PM2 Process Isolation Incidents
- Test Artifact: test-pm2-safeguard-logic.js
- ADR-014: Containerization and Deployment Strategy
Appendix: Workflow Changes Summary
deploy-to-prod.yml
+ - name: Log Workflow Metadata
+ run: |
+ echo "=== WORKFLOW METADATA ==="
+ echo "Workflow file: deploy-to-prod.yml"
+ echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
+ ...
- name: Install Backend Dependencies and Restart Production Server
run: |
+ # === PRE-CLEANUP PM2 STATE LOGGING ===
+ echo "=== PRE-CLEANUP PM2 STATE ==="
+ pm2 jlist
+ echo "=== END PRE-CLEANUP STATE ==="
+
# --- Cleanup Errored Processes with Defense-in-Depth Safeguards ---
node -e "
...
+ // SAFEGUARD 1: Process count validation
+ if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
+ console.error('SAFETY ABORT: Filter would delete ALL processes!');
+ process.exit(1);
+ }
+
+ // SAFEGUARD 2: Explicit name verification
+ console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
+ targetProcesses.forEach(p => {
+ console.log(' - ' + p.name + ' (status: ' + p.pm2_env.status + ')');
+ });
...
"
+
+ # === POST-CLEANUP VERIFICATION ===
+ echo "=== POST-CLEANUP VERIFICATION ==="
+ pm2 jlist | node -e "..."
+ echo "=== END POST-CLEANUP VERIFICATION ==="
Similar changes were applied to deploy-to-test.yml and manual-deploy-major.yml.
Session Participants
| Role | Agent Type | Responsibility |
|---|---|---|
| Orchestrator | Main Claude | Session coordination and delegation |
| Planner | planner subagent | Incident analysis and solution design |
| Documenter | describer-for-ai subagent | Incident report creation |
| Coder #1 | coder subagent | Workflow safeguard implementation |
| Coder #2 | coder subagent | Incident response runbook creation |
| Coder #3 | coder subagent | CLAUDE.md updates |
| Tester | tester subagent | Comprehensive validation |
| Archivist | Lead Technical Archivist | Final documentation |
Revision History
| Date | Author | Change |
|---|---|---|
| 2026-02-17 | Lead Technical Archivist | Initial session summary |