All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s
270 lines
9.0 KiB
Markdown
270 lines
9.0 KiB
Markdown
# Incident Report: PM2 Process Kill During v0.15.0 Deployment
|
|
|
|
**Date**: 2026-02-17
|
|
**Severity**: Critical
|
|
**Status**: Mitigated - Safeguards Implemented
|
|
**Affected Systems**: All PM2-managed applications on projectium.com server
|
|
|
|
---
|
|
|
|
## Resolution Summary
|
|
|
|
**Safeguards implemented on 2026-02-17** to prevent recurrence:
|
|
|
|
1. Workflow metadata logging (audit trail)
|
|
2. Pre-cleanup PM2 state logging (forensics)
|
|
3. Process count validation with SAFETY ABORT (automatic prevention)
|
|
4. Explicit name verification (visibility)
|
|
5. Post-cleanup verification (environment isolation check)
|
|
|
|
**Documentation created**:
|
|
|
|
- [PM2 Incident Response Runbook](PM2-INCIDENT-RESPONSE.md)
|
|
- [PM2 Safeguards Session Summary](../archive/sessions/PM2_SAFEGUARDS_SESSION_2026-02-17.md)
|
|
- CLAUDE.md updated with [PM2 Process Isolation Incidents section](../../CLAUDE.md#pm2-process-isolation-incidents)
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
During v0.15.0 production deployment, ALL PM2 processes on the server were terminated, not just flyer-crawler processes. This caused unplanned downtime for other applications including stock-alert.
|
|
|
|
## Timeline
|
|
|
|
| Time (Approx) | Event |
|
|
| --------------------- | ---------------------------------------------------------------- |
|
|
| 2026-02-17 ~07:40 UTC | v0.15.0 production deployment triggered via `deploy-to-prod.yml` |
|
|
| Unknown | All PM2 processes killed (flyer-crawler AND other apps) |
|
|
| Unknown | Incident discovered - stock-alert down |
|
|
| 2026-02-17 | Investigation initiated |
|
|
| 2026-02-17 | Defense-in-depth safeguards implemented in all workflows |
|
|
| 2026-02-17 | Incident response runbook created |
|
|
| 2026-02-17 | Status changed to Mitigated |
|
|
|
|
## Impact
|
|
|
|
- **Affected Applications**: All PM2-managed processes on projectium.com
|
|
- flyer-crawler-api, flyer-crawler-worker, flyer-crawler-analytics-worker (expected)
|
|
- stock-alert (NOT expected - collateral damage)
|
|
- Potentially other unidentified applications
|
|
- **Downtime Duration**: TBD
|
|
- **User Impact**: Service unavailability for all affected applications
|
|
|
|
---
|
|
|
|
## Investigation Findings
|
|
|
|
### Deployment Workflow Analysis
|
|
|
|
All deployment workflows were reviewed for PM2 process isolation:
|
|
|
|
| Workflow | PM2 Isolation | Implementation |
|
|
| ------------------------- | -------------- | ------------------------------------------------------------------------------------------------- |
|
|
| `deploy-to-prod.yml` | Whitelist | `prodProcesses = ['flyer-crawler-api', 'flyer-crawler-worker', 'flyer-crawler-analytics-worker']` |
|
|
| `deploy-to-test.yml` | Pattern | `p.name.endsWith('-test')` |
|
|
| `manual-deploy-major.yml` | Whitelist | Same as deploy-to-prod |
|
|
| `manual-db-restore.yml` | Explicit names | `pm2 stop flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker` |
|
|
|
|
### Fix Commit Already In Place
|
|
|
|
The PM2 process isolation fix was implemented in commit `b6a62a0` (2026-02-13):
|
|
|
|
```
|
|
commit b6a62a036f39ac895271402a61e5cc4227369de7
|
|
Author: Torben Sorensen <torben.sorensen@gmail.com>
|
|
Date: Fri Feb 13 10:19:28 2026 -0800
|
|
|
|
be specific about pm2 processes
|
|
|
|
Files modified:
|
|
.gitea/workflows/deploy-to-prod.yml
|
|
.gitea/workflows/deploy-to-test.yml
|
|
.gitea/workflows/manual-db-restore.yml
|
|
.gitea/workflows/manual-deploy-major.yml
|
|
CLAUDE.md
|
|
```
|
|
|
|
### v0.15.0 Release Contains Fix
|
|
|
|
Confirmed: v0.15.0 (commit `93ad624`, 2026-02-18) includes the fix commit:
|
|
|
|
```
|
|
93ad624 ci: Bump version to 0.15.0 for production release [skip ci]
|
|
...
|
|
b6a62a0 be specific about pm2 processes <-- Fix commit included
|
|
```
|
|
|
|
### Current Workflow PM2 Commands
|
|
|
|
**Production Deploy (`deploy-to-prod.yml` line 170)**:
|
|
|
|
```javascript
|
|
const prodProcesses = [
|
|
'flyer-crawler-api',
|
|
'flyer-crawler-worker',
|
|
'flyer-crawler-analytics-worker',
|
|
];
|
|
list.forEach((p) => {
|
|
if (
|
|
(p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
|
|
prodProcesses.includes(p.name)
|
|
) {
|
|
exec('pm2 delete ' + p.pm2_env.pm_id);
|
|
}
|
|
});
|
|
```
|
|
|
|
**Test Deploy (`deploy-to-test.yml` line 100)**:
|
|
|
|
```javascript
|
|
list.forEach((p) => {
|
|
if (p.name && p.name.endsWith('-test')) {
|
|
exec('pm2 delete ' + p.pm2_env.pm_id);
|
|
}
|
|
});
|
|
```
|
|
|
|
Both implementations have proper name filtering and should NOT affect non-flyer-crawler processes.
|
|
|
|
---
|
|
|
|
## Discrepancy Analysis
|
|
|
|
### Key Mystery
|
|
|
|
**If the fixes are in place, why did ALL processes get killed?**
|
|
|
|
### Possible Explanations
|
|
|
|
#### 1. Workflow Version Mismatch (HIGH PROBABILITY)
|
|
|
|
**Hypothesis**: Gitea runner cached an older version of the workflow file.
|
|
|
|
- Gitea Actions may cache workflow definitions
|
|
- The runner might have executed an older version without the fix
|
|
- Need to verify: What version of `deploy-to-prod.yml` actually executed?
|
|
|
|
**Investigation Required**:
|
|
|
|
- Check Gitea workflow execution logs for actual script content
|
|
- Verify runner workflow caching behavior
|
|
- Compare executed workflow vs repository version
|
|
|
|
#### 2. Concurrent Workflow Execution (MEDIUM PROBABILITY)
|
|
|
|
**Hypothesis**: Another workflow ran simultaneously with destructive PM2 commands.
|
|
|
|
Workflows with potential issues:
|
|
|
|
- `manual-db-reset-prod.yml` - Does NOT restart PM2 (schema reset only)
|
|
- `manual-redis-flush-prod.yml` - Does NOT touch PM2
|
|
- Test deployment concurrent with prod deployment
|
|
|
|
**Investigation Required**:
|
|
|
|
- Check Gitea Actions history for concurrent workflow runs
|
|
- Review timestamps of all workflow executions on 2026-02-17
|
|
|
|
#### 3. Manual SSH Command (MEDIUM PROBABILITY)
|
|
|
|
**Hypothesis**: Someone SSH'd to the server and ran `pm2 stop all` or `pm2 delete all` manually.
|
|
|
|
**Investigation Required**:
|
|
|
|
- Check server shell history (if available)
|
|
- Review any maintenance windows or manual interventions
|
|
- Ask team members about manual actions
|
|
|
|
#### 4. PM2 Internal Issue (LOW PROBABILITY)
|
|
|
|
**Hypothesis**: PM2 daemon crash or corruption caused all processes to stop.
|
|
|
|
**Investigation Required**:
|
|
|
|
- Check PM2 daemon logs on server
|
|
- Look for OOM killer events in system logs
|
|
- Check disk space issues during deployment
|
|
|
|
#### 5. Script Execution Error (LOW PROBABILITY)
|
|
|
|
**Hypothesis**: JavaScript parsing error caused the filtering logic to be bypassed.
|
|
|
|
**Investigation Required**:
|
|
|
|
- Review workflow execution logs for JavaScript errors
|
|
- Test the inline Node.js scripts locally
|
|
- Check for shell escaping issues
|
|
|
|
---
|
|
|
|
## Documentation/Code Gaps Identified
|
|
|
|
### CLAUDE.md Documentation
|
|
|
|
The PM2 isolation rules are documented in `CLAUDE.md`, but:
|
|
|
|
- Documentation uses `pm2 restart all` in the Quick Reference table (for dev container - acceptable)
|
|
- Multiple docs still reference `pm2 restart all` without environment context
|
|
- No incident response runbook for PM2 issues
|
|
|
|
### Workflow Gaps
|
|
|
|
1. **No Workflow Audit Trail**: No logging of which exact workflow version executed
|
|
2. **No Pre-deployment Verification**: Workflows don't log PM2 state before modifications
|
|
3. **No Cross-Application Impact Assessment**: No mechanism to detect/warn about other apps
|
|
|
|
---
|
|
|
|
## Next Steps for Root Cause Analysis
|
|
|
|
### Immediate (Priority 1)
|
|
|
|
1. [ ] Retrieve Gitea Actions execution logs for v0.15.0 deployment
|
|
2. [ ] Extract actual executed workflow content from logs
|
|
3. [ ] Check for concurrent workflow executions on 2026-02-17
|
|
4. [ ] Review server PM2 daemon logs around incident time
|
|
|
|
### Short-term (Priority 2)
|
|
|
|
5. [ ] Implement pre-deployment PM2 state logging in workflows
|
|
6. [ ] Add workflow version hash logging for audit trail
|
|
7. [ ] Create incident response runbook for PM2/deployment issues
|
|
|
|
### Long-term (Priority 3)
|
|
|
|
8. [ ] Evaluate PM2 namespacing for complete process isolation
|
|
9. [ ] Consider separate PM2 daemon per application
|
|
10. [ ] Implement deployment monitoring/alerting
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- [CLAUDE.md - PM2 Process Isolation](../../../CLAUDE.md) (Critical Rules section)
|
|
- [ADR-014: Containerization and Deployment Strategy](../adr/0014-containerization-and-deployment-strategy.md)
|
|
- [Deployment Guide](./DEPLOYMENT.md)
|
|
- Workflow files in `.gitea/workflows/`
|
|
|
|
---
|
|
|
|
## Appendix: Commit Timeline
|
|
|
|
```
|
|
93ad624 ci: Bump version to 0.15.0 for production release [skip ci] <-- v0.15.0 release
|
|
7dd4f21 ci: Bump version to 0.14.4 [skip ci]
|
|
174b637 even more typescript fixes
|
|
4f80baf ci: Bump version to 0.14.3 [skip ci]
|
|
8450b5e Generate TSOA Spec and Routes
|
|
e4d830a ci: Bump version to 0.14.2 [skip ci]
|
|
b6a62a0 be specific about pm2 processes <-- PM2 fix commit
|
|
2d2cd52 Massive Dependency Modernization Project
|
|
```
|
|
|
|
---
|
|
|
|
## Revision History
|
|
|
|
| Date | Author | Change |
|
|
| ---------- | ------------------ | ----------------------- |
|
|
| 2026-02-17 | Investigation Team | Initial incident report |
|