9.0 KiB
Incident Report: PM2 Process Kill During v0.15.0 Deployment
Date: 2026-02-17 Severity: Critical Status: Mitigated - Safeguards Implemented Affected Systems: All PM2-managed applications on projectium.com server
Resolution Summary
Safeguards implemented on 2026-02-17 to prevent recurrence:
- Workflow metadata logging (audit trail)
- Pre-cleanup PM2 state logging (forensics)
- Process count validation with SAFETY ABORT (automatic prevention)
- Explicit name verification (visibility)
- Post-cleanup verification (environment isolation check)
Documentation created:
- PM2 Incident Response Runbook
- PM2 Safeguards Session Summary
- CLAUDE.md updated with PM2 Process Isolation Incidents section
Summary
During v0.15.0 production deployment, ALL PM2 processes on the server were terminated, not just flyer-crawler processes. This caused unplanned downtime for other applications including stock-alert.
Timeline
| Time (Approx) | Event |
|---|---|
| 2026-02-17 ~07:40 UTC | v0.15.0 production deployment triggered via deploy-to-prod.yml |
| Unknown | All PM2 processes killed (flyer-crawler AND other apps) |
| Unknown | Incident discovered - stock-alert down |
| 2026-02-17 | Investigation initiated |
| 2026-02-17 | Defense-in-depth safeguards implemented in all workflows |
| 2026-02-17 | Incident response runbook created |
| 2026-02-17 | Status changed to Mitigated |
Impact
- Affected Applications: All PM2-managed processes on projectium.com
- flyer-crawler-api, flyer-crawler-worker, flyer-crawler-analytics-worker (expected)
- stock-alert (NOT expected - collateral damage)
- Potentially other unidentified applications
- Downtime Duration: TBD
- User Impact: Service unavailability for all affected applications
Investigation Findings
Deployment Workflow Analysis
All deployment workflows were reviewed for PM2 process isolation:
| Workflow | PM2 Isolation | Implementation |
|---|---|---|
deploy-to-prod.yml |
Whitelist | prodProcesses = ['flyer-crawler-api', 'flyer-crawler-worker', 'flyer-crawler-analytics-worker'] |
deploy-to-test.yml |
Pattern | p.name.endsWith('-test') |
manual-deploy-major.yml |
Whitelist | Same as deploy-to-prod |
manual-db-restore.yml |
Explicit names | pm2 stop flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker |
Fix Commit Already In Place
The PM2 process isolation fix was implemented in commit b6a62a0 (2026-02-13):
commit b6a62a036f39ac895271402a61e5cc4227369de7
Author: Torben Sorensen <torben.sorensen@gmail.com>
Date: Fri Feb 13 10:19:28 2026 -0800
be specific about pm2 processes
Files modified:
.gitea/workflows/deploy-to-prod.yml
.gitea/workflows/deploy-to-test.yml
.gitea/workflows/manual-db-restore.yml
.gitea/workflows/manual-deploy-major.yml
CLAUDE.md
v0.15.0 Release Contains Fix
Confirmed: v0.15.0 (commit 93ad624, 2026-02-18) includes the fix commit:
93ad624 ci: Bump version to 0.15.0 for production release [skip ci]
...
b6a62a0 be specific about pm2 processes <-- Fix commit included
Current Workflow PM2 Commands
Production Deploy (deploy-to-prod.yml line 170):
const prodProcesses = [
'flyer-crawler-api',
'flyer-crawler-worker',
'flyer-crawler-analytics-worker',
];
list.forEach((p) => {
if (
(p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
prodProcesses.includes(p.name)
) {
exec('pm2 delete ' + p.pm2_env.pm_id);
}
});
Test Deploy (deploy-to-test.yml line 100):
list.forEach((p) => {
if (p.name && p.name.endsWith('-test')) {
exec('pm2 delete ' + p.pm2_env.pm_id);
}
});
Both implementations have proper name filtering and should NOT affect non-flyer-crawler processes.
Discrepancy Analysis
Key Mystery
If the fixes are in place, why did ALL processes get killed?
Possible Explanations
1. Workflow Version Mismatch (HIGH PROBABILITY)
Hypothesis: Gitea runner cached an older version of the workflow file.
- Gitea Actions may cache workflow definitions
- The runner might have executed an older version without the fix
- Need to verify: What version of
deploy-to-prod.ymlactually executed?
Investigation Required:
- Check Gitea workflow execution logs for actual script content
- Verify runner workflow caching behavior
- Compare executed workflow vs repository version
2. Concurrent Workflow Execution (MEDIUM PROBABILITY)
Hypothesis: Another workflow ran simultaneously with destructive PM2 commands.
Workflows with potential issues:
manual-db-reset-prod.yml- Does NOT restart PM2 (schema reset only)manual-redis-flush-prod.yml- Does NOT touch PM2- Test deployment concurrent with prod deployment
Investigation Required:
- Check Gitea Actions history for concurrent workflow runs
- Review timestamps of all workflow executions on 2026-02-17
3. Manual SSH Command (MEDIUM PROBABILITY)
Hypothesis: Someone SSH'd to the server and ran pm2 stop all or pm2 delete all manually.
Investigation Required:
- Check server shell history (if available)
- Review any maintenance windows or manual interventions
- Ask team members about manual actions
4. PM2 Internal Issue (LOW PROBABILITY)
Hypothesis: PM2 daemon crash or corruption caused all processes to stop.
Investigation Required:
- Check PM2 daemon logs on server
- Look for OOM killer events in system logs
- Check disk space issues during deployment
5. Script Execution Error (LOW PROBABILITY)
Hypothesis: JavaScript parsing error caused the filtering logic to be bypassed.
Investigation Required:
- Review workflow execution logs for JavaScript errors
- Test the inline Node.js scripts locally
- Check for shell escaping issues
Documentation/Code Gaps Identified
CLAUDE.md Documentation
The PM2 isolation rules are documented in CLAUDE.md, but:
- Documentation uses
pm2 restart allin the Quick Reference table (for dev container - acceptable) - Multiple docs still reference
pm2 restart allwithout environment context - No incident response runbook for PM2 issues
Workflow Gaps
- No Workflow Audit Trail: No logging of which exact workflow version executed
- No Pre-deployment Verification: Workflows don't log PM2 state before modifications
- No Cross-Application Impact Assessment: No mechanism to detect/warn about other apps
Next Steps for Root Cause Analysis
Immediate (Priority 1)
- Retrieve Gitea Actions execution logs for v0.15.0 deployment
- Extract actual executed workflow content from logs
- Check for concurrent workflow executions on 2026-02-17
- Review server PM2 daemon logs around incident time
Short-term (Priority 2)
- Implement pre-deployment PM2 state logging in workflows
- Add workflow version hash logging for audit trail
- Create incident response runbook for PM2/deployment issues
Long-term (Priority 3)
- Evaluate PM2 namespacing for complete process isolation
- Consider separate PM2 daemon per application
- Implement deployment monitoring/alerting
Related Documentation
- CLAUDE.md - PM2 Process Isolation (Critical Rules section)
- ADR-014: Containerization and Deployment Strategy
- Deployment Guide
- Workflow files in
.gitea/workflows/
Appendix: Commit Timeline
93ad624 ci: Bump version to 0.15.0 for production release [skip ci] <-- v0.15.0 release
7dd4f21 ci: Bump version to 0.14.4 [skip ci]
174b637 even more typescript fixes
4f80baf ci: Bump version to 0.14.3 [skip ci]
8450b5e Generate TSOA Spec and Routes
e4d830a ci: Bump version to 0.14.2 [skip ci]
b6a62a0 be specific about pm2 processes <-- PM2 fix commit
2d2cd52 Massive Dependency Modernization Project
Revision History
| Date | Author | Change |
|---|---|---|
| 2026-02-17 | Investigation Team | Initial incident report |