Files
flyer-crawler.projectium.com/docs/operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md
Torben Sorensen c059b30201
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s
PM2 Process Isolation
2026-02-17 20:49:01 -08:00

270 lines
9.0 KiB
Markdown

# Incident Report: PM2 Process Kill During v0.15.0 Deployment
**Date**: 2026-02-17
**Severity**: Critical
**Status**: Mitigated - Safeguards Implemented
**Affected Systems**: All PM2-managed applications on projectium.com server
---
## Resolution Summary
**Safeguards implemented on 2026-02-17** to prevent recurrence:
1. Workflow metadata logging (audit trail)
2. Pre-cleanup PM2 state logging (forensics)
3. Process count validation with SAFETY ABORT (automatic prevention)
4. Explicit name verification (visibility)
5. Post-cleanup verification (environment isolation check)
**Documentation created**:
- [PM2 Incident Response Runbook](PM2-INCIDENT-RESPONSE.md)
- [PM2 Safeguards Session Summary](../archive/sessions/PM2_SAFEGUARDS_SESSION_2026-02-17.md)
- CLAUDE.md updated with [PM2 Process Isolation Incidents section](../../CLAUDE.md#pm2-process-isolation-incidents)
---
## Summary
During v0.15.0 production deployment, ALL PM2 processes on the server were terminated, not just flyer-crawler processes. This caused unplanned downtime for other applications including stock-alert.
## Timeline
| Time (Approx) | Event |
| --------------------- | ---------------------------------------------------------------- |
| 2026-02-17 ~07:40 UTC | v0.15.0 production deployment triggered via `deploy-to-prod.yml` |
| Unknown | All PM2 processes killed (flyer-crawler AND other apps) |
| Unknown | Incident discovered - stock-alert down |
| 2026-02-17 | Investigation initiated |
| 2026-02-17 | Defense-in-depth safeguards implemented in all workflows |
| 2026-02-17 | Incident response runbook created |
| 2026-02-17 | Status changed to Mitigated |
## Impact
- **Affected Applications**: All PM2-managed processes on projectium.com
- flyer-crawler-api, flyer-crawler-worker, flyer-crawler-analytics-worker (expected)
- stock-alert (NOT expected - collateral damage)
- Potentially other unidentified applications
- **Downtime Duration**: TBD
- **User Impact**: Service unavailability for all affected applications
---
## Investigation Findings
### Deployment Workflow Analysis
All deployment workflows were reviewed for PM2 process isolation:
| Workflow | PM2 Isolation | Implementation |
| ------------------------- | -------------- | ------------------------------------------------------------------------------------------------- |
| `deploy-to-prod.yml` | Whitelist | `prodProcesses = ['flyer-crawler-api', 'flyer-crawler-worker', 'flyer-crawler-analytics-worker']` |
| `deploy-to-test.yml` | Pattern | `p.name.endsWith('-test')` |
| `manual-deploy-major.yml` | Whitelist | Same as deploy-to-prod |
| `manual-db-restore.yml` | Explicit names | `pm2 stop flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker` |
### Fix Commit Already In Place
The PM2 process isolation fix was implemented in commit `b6a62a0` (2026-02-13):
```
commit b6a62a036f39ac895271402a61e5cc4227369de7
Author: Torben Sorensen <torben.sorensen@gmail.com>
Date: Fri Feb 13 10:19:28 2026 -0800
be specific about pm2 processes
Files modified:
.gitea/workflows/deploy-to-prod.yml
.gitea/workflows/deploy-to-test.yml
.gitea/workflows/manual-db-restore.yml
.gitea/workflows/manual-deploy-major.yml
CLAUDE.md
```
### v0.15.0 Release Contains Fix
Confirmed: v0.15.0 (commit `93ad624`, 2026-02-18) includes the fix commit:
```
93ad624 ci: Bump version to 0.15.0 for production release [skip ci]
...
b6a62a0 be specific about pm2 processes <-- Fix commit included
```
### Current Workflow PM2 Commands
**Production Deploy (`deploy-to-prod.yml` line 170)**:
```javascript
const prodProcesses = [
'flyer-crawler-api',
'flyer-crawler-worker',
'flyer-crawler-analytics-worker',
];
list.forEach((p) => {
if (
(p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
prodProcesses.includes(p.name)
) {
exec('pm2 delete ' + p.pm2_env.pm_id);
}
});
```
**Test Deploy (`deploy-to-test.yml` line 100)**:
```javascript
list.forEach((p) => {
if (p.name && p.name.endsWith('-test')) {
exec('pm2 delete ' + p.pm2_env.pm_id);
}
});
```
Both implementations have proper name filtering and should NOT affect non-flyer-crawler processes.
---
## Discrepancy Analysis
### Key Mystery
**If the fixes are in place, why did ALL processes get killed?**
### Possible Explanations
#### 1. Workflow Version Mismatch (HIGH PROBABILITY)
**Hypothesis**: Gitea runner cached an older version of the workflow file.
- Gitea Actions may cache workflow definitions
- The runner might have executed an older version without the fix
- Need to verify: What version of `deploy-to-prod.yml` actually executed?
**Investigation Required**:
- Check Gitea workflow execution logs for actual script content
- Verify runner workflow caching behavior
- Compare executed workflow vs repository version
#### 2. Concurrent Workflow Execution (MEDIUM PROBABILITY)
**Hypothesis**: Another workflow ran simultaneously with destructive PM2 commands.
Workflows with potential issues:
- `manual-db-reset-prod.yml` - Does NOT restart PM2 (schema reset only)
- `manual-redis-flush-prod.yml` - Does NOT touch PM2
- Test deployment concurrent with prod deployment
**Investigation Required**:
- Check Gitea Actions history for concurrent workflow runs
- Review timestamps of all workflow executions on 2026-02-17
#### 3. Manual SSH Command (MEDIUM PROBABILITY)
**Hypothesis**: Someone SSH'd to the server and ran `pm2 stop all` or `pm2 delete all` manually.
**Investigation Required**:
- Check server shell history (if available)
- Review any maintenance windows or manual interventions
- Ask team members about manual actions
#### 4. PM2 Internal Issue (LOW PROBABILITY)
**Hypothesis**: PM2 daemon crash or corruption caused all processes to stop.
**Investigation Required**:
- Check PM2 daemon logs on server
- Look for OOM killer events in system logs
- Check disk space issues during deployment
#### 5. Script Execution Error (LOW PROBABILITY)
**Hypothesis**: JavaScript parsing error caused the filtering logic to be bypassed.
**Investigation Required**:
- Review workflow execution logs for JavaScript errors
- Test the inline Node.js scripts locally
- Check for shell escaping issues
---
## Documentation/Code Gaps Identified
### CLAUDE.md Documentation
The PM2 isolation rules are documented in `CLAUDE.md`, but:
- Documentation uses `pm2 restart all` in the Quick Reference table (for dev container - acceptable)
- Multiple docs still reference `pm2 restart all` without environment context
- No incident response runbook for PM2 issues
### Workflow Gaps
1. **No Workflow Audit Trail**: No logging of which exact workflow version executed
2. **No Pre-deployment Verification**: Workflows don't log PM2 state before modifications
3. **No Cross-Application Impact Assessment**: No mechanism to detect/warn about other apps
---
## Next Steps for Root Cause Analysis
### Immediate (Priority 1)
1. [ ] Retrieve Gitea Actions execution logs for v0.15.0 deployment
2. [ ] Extract actual executed workflow content from logs
3. [ ] Check for concurrent workflow executions on 2026-02-17
4. [ ] Review server PM2 daemon logs around incident time
### Short-term (Priority 2)
5. [ ] Implement pre-deployment PM2 state logging in workflows
6. [ ] Add workflow version hash logging for audit trail
7. [ ] Create incident response runbook for PM2/deployment issues
### Long-term (Priority 3)
8. [ ] Evaluate PM2 namespacing for complete process isolation
9. [ ] Consider separate PM2 daemon per application
10. [ ] Implement deployment monitoring/alerting
---
## Related Documentation
- [CLAUDE.md - PM2 Process Isolation](../../../CLAUDE.md) (Critical Rules section)
- [ADR-014: Containerization and Deployment Strategy](../adr/0014-containerization-and-deployment-strategy.md)
- [Deployment Guide](./DEPLOYMENT.md)
- Workflow files in `.gitea/workflows/`
---
## Appendix: Commit Timeline
```
93ad624 ci: Bump version to 0.15.0 for production release [skip ci] <-- v0.15.0 release
7dd4f21 ci: Bump version to 0.14.4 [skip ci]
174b637 even more typescript fixes
4f80baf ci: Bump version to 0.14.3 [skip ci]
8450b5e Generate TSOA Spec and Routes
e4d830a ci: Bump version to 0.14.2 [skip ci]
b6a62a0 be specific about pm2 processes <-- PM2 fix commit
2d2cd52 Massive Dependency Modernization Project
```
---
## Revision History
| Date | Author | Change |
| ---------- | ------------------ | ----------------------- |
| 2026-02-17 | Investigation Team | Initial incident report |