PM2 Process Isolation
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s
This commit is contained in:
377
docs/archive/sessions/PM2_SAFEGUARDS_SESSION_2026-02-17.md
Normal file
377
docs/archive/sessions/PM2_SAFEGUARDS_SESSION_2026-02-17.md
Normal file
@@ -0,0 +1,377 @@
|
||||
# PM2 Process Isolation Safeguards Project
|
||||
|
||||
**Session Date**: 2026-02-17
|
||||
**Status**: Completed
|
||||
**Triggered By**: Critical production incident during v0.15.0 deployment
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
On 2026-02-17, a critical incident occurred during v0.15.0 production deployment where ALL PM2 processes on the production server were killed, not just the flyer-crawler processes. This caused unplanned downtime for multiple applications including `stock-alert.projectium.com`.
|
||||
|
||||
Despite PM2 process isolation fixes already being in place (commit `b6a62a0`), the incident still occurred. Investigation suggests the Gitea runner may have executed a cached/older version of the workflow files. In response, we implemented a comprehensive defense-in-depth strategy with 5 layers of safeguards across all deployment workflows.
|
||||
|
||||
---
|
||||
|
||||
## Incident Background
|
||||
|
||||
### What Happened
|
||||
|
||||
| Aspect | Detail |
|
||||
| --------------------- | ------------------------------------------------------- |
|
||||
| **Date/Time** | 2026-02-17 ~07:40 UTC |
|
||||
| **Trigger** | v0.15.0 production deployment via `deploy-to-prod.yml` |
|
||||
| **Impact** | ALL PM2 processes killed (all environments) |
|
||||
| **Collateral Damage** | `stock-alert.projectium.com` and other PM2-managed apps |
|
||||
| **Severity** | P1 - Critical |
|
||||
|
||||
### Key Mystery
|
||||
|
||||
The PM2 process isolation fix was already implemented in commit `b6a62a0` (2026-02-13) and was included in v0.15.0. The fix correctly used whitelist-based filtering:
|
||||
|
||||
```javascript
|
||||
const prodProcesses = [
|
||||
'flyer-crawler-api',
|
||||
'flyer-crawler-worker',
|
||||
'flyer-crawler-analytics-worker',
|
||||
];
|
||||
list.forEach((p) => {
|
||||
if (
|
||||
(p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
|
||||
prodProcesses.includes(p.name)
|
||||
) {
|
||||
exec('pm2 delete ' + p.pm2_env.pm_id);
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
**Hypothesis**: Gitea runner executed a cached older version of the workflow file that did not contain the fix.
|
||||
|
||||
---
|
||||
|
||||
## Solution: Defense-in-Depth Safeguards
|
||||
|
||||
Rather than relying solely on the filter logic (which may be correct but not executed), we implemented 5 layers of safeguards that provide visibility, validation, and automatic abort capabilities.
|
||||
|
||||
### Safeguard Layers
|
||||
|
||||
| Layer | Name | Purpose |
|
||||
| ----- | --------------------------------- | ------------------------------------------------------- |
|
||||
| 1 | **Workflow Metadata Logging** | Audit trail of which workflow version actually executed |
|
||||
| 2 | **Pre-Cleanup PM2 State Logging** | Capture full process list before any modifications |
|
||||
| 3 | **Process Count Validation** | SAFETY ABORT if filter would delete ALL processes |
|
||||
| 4 | **Explicit Name Verification** | Log exactly which processes will be affected |
|
||||
| 5 | **Post-Cleanup Verification** | Verify environment isolation after cleanup |
|
||||
|
||||
### Layer Details
|
||||
|
||||
#### Layer 1: Workflow Metadata Logging
|
||||
|
||||
Logs at the start of deployment:
|
||||
|
||||
- Workflow file name
|
||||
- SHA-256 hash of the workflow file
|
||||
- Git commit being deployed
|
||||
- Git branch
|
||||
- Timestamp (UTC)
|
||||
- Actor (who triggered the deployment)
|
||||
|
||||
**Purpose**: If an incident occurs, we can verify whether the executed workflow matches the repository version.
|
||||
|
||||
```bash
|
||||
echo "=== WORKFLOW METADATA ==="
|
||||
echo "Workflow file: deploy-to-prod.yml"
|
||||
echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
|
||||
echo "Git commit: $(git rev-parse HEAD)"
|
||||
echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
|
||||
echo "Actor: ${{ gitea.actor }}"
|
||||
echo "=== END METADATA ==="
|
||||
```
|
||||
|
||||
#### Layer 2: Pre-Cleanup PM2 State Logging
|
||||
|
||||
Captures full PM2 process list in JSON format before any modifications.
|
||||
|
||||
**Purpose**: Provides forensic evidence of what processes existed before cleanup began.
|
||||
|
||||
```bash
|
||||
echo "=== PRE-CLEANUP PM2 STATE ==="
|
||||
pm2 jlist
|
||||
echo "=== END PRE-CLEANUP STATE ==="
|
||||
```
|
||||
|
||||
#### Layer 3: Process Count Validation (SAFETY ABORT)
|
||||
|
||||
The most critical safeguard. Aborts the entire deployment if the filter would delete ALL processes and there are more than 3 processes total.
|
||||
|
||||
**Purpose**: Catches filter bugs or unexpected conditions that would result in catastrophic process deletion.
|
||||
|
||||
```javascript
|
||||
// SAFEGUARD 1: Process count validation
|
||||
const totalProcesses = list.length;
|
||||
if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
|
||||
console.error('SAFETY ABORT: Filter would delete ALL processes!');
|
||||
console.error(
|
||||
'Total processes: ' + totalProcesses + ', Target processes: ' + targetProcesses.length,
|
||||
);
|
||||
console.error('This indicates a potential filter bug. Aborting cleanup.');
|
||||
process.exit(1);
|
||||
}
|
||||
```
|
||||
|
||||
**Threshold Rationale**: The threshold of 3 allows normal operation when only the 3 expected processes exist (API, Worker, Analytics Worker) while catching anomalies when the server hosts more applications.
|
||||
|
||||
#### Layer 4: Explicit Name Verification
|
||||
|
||||
Logs the exact name, status, and PM2 ID of each process that will be deleted.
|
||||
|
||||
**Purpose**: Provides clear visibility into what the cleanup operation will actually do.
|
||||
|
||||
```javascript
|
||||
console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
|
||||
targetProcesses.forEach((p) => {
|
||||
console.log(
|
||||
' - ' + p.name + ' (status: ' + p.pm2_env.status + ', pm_id: ' + p.pm2_env.pm_id + ')',
|
||||
);
|
||||
});
|
||||
```
|
||||
|
||||
#### Layer 5: Post-Cleanup Verification
|
||||
|
||||
After cleanup, logs the state of processes by environment to verify isolation was maintained.
|
||||
|
||||
**Purpose**: Immediately identifies if the cleanup affected the wrong environment.
|
||||
|
||||
```bash
|
||||
echo "=== POST-CLEANUP VERIFICATION ==="
|
||||
pm2 jlist | node -e "
|
||||
const list = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
|
||||
const prodProcesses = list.filter(p => p.name && p.name.startsWith('flyer-crawler-') && !p.name.endsWith('-test'));
|
||||
const testProcesses = list.filter(p => p.name && p.name.endsWith('-test'));
|
||||
console.log('Production processes after cleanup: ' + prodProcesses.length);
|
||||
console.log('Test processes (should be untouched): ' + testProcesses.length);
|
||||
"
|
||||
echo "=== END POST-CLEANUP VERIFICATION ==="
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Files Modified
|
||||
|
||||
| File | Changes |
|
||||
| ------------------------------------------ | --------------------------------------------- |
|
||||
| `.gitea/workflows/deploy-to-prod.yml` | Added all 5 safeguard layers |
|
||||
| `.gitea/workflows/deploy-to-test.yml` | Added all 5 safeguard layers |
|
||||
| `.gitea/workflows/manual-deploy-major.yml` | Added all 5 safeguard layers |
|
||||
| `CLAUDE.md` | Added PM2 Process Isolation Incidents section |
|
||||
|
||||
### Files Created
|
||||
|
||||
| File | Purpose |
|
||||
| --------------------------------------------------------- | --------------------------------------- |
|
||||
| `docs/operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md` | Detailed incident report |
|
||||
| `docs/operations/PM2-INCIDENT-RESPONSE.md` | Comprehensive incident response runbook |
|
||||
| `tests/qa/test-pm2-safeguard-logic.js` | Validation tests for safeguard logic |
|
||||
|
||||
---
|
||||
|
||||
## Testing and Validation
|
||||
|
||||
### Test Artifact
|
||||
|
||||
A standalone JavaScript test file was created to validate the safeguard logic:
|
||||
|
||||
**File**: `tests/qa/test-pm2-safeguard-logic.js`
|
||||
|
||||
**Test Categories**:
|
||||
|
||||
1. **Normal Operations (should NOT abort)**
|
||||
- 3 errored out of 15 processes
|
||||
- 1 errored out of 10 processes
|
||||
- 0 processes to clean
|
||||
- Fresh server with 3 processes (threshold boundary)
|
||||
|
||||
2. **Dangerous Operations (SHOULD abort)**
|
||||
- All 10 processes targeted
|
||||
- All 15 processes targeted
|
||||
- All 4 processes targeted (just above threshold)
|
||||
|
||||
3. **Workflow-Specific Filter Tests**
|
||||
- Production filter only matches production processes
|
||||
- Test filter only matches `-test` suffix processes
|
||||
- Filters don't cross-contaminate environments
|
||||
|
||||
### Test Results
|
||||
|
||||
All 11 scenarios passed:
|
||||
|
||||
| Scenario | Total | Target | Expected | Result |
|
||||
| -------------------------- | ----- | ------ | -------- | ------ |
|
||||
| Normal prod cleanup | 15 | 3 | No abort | PASS |
|
||||
| Normal test cleanup | 15 | 3 | No abort | PASS |
|
||||
| Single process | 10 | 1 | No abort | PASS |
|
||||
| No cleanup needed | 10 | 0 | No abort | PASS |
|
||||
| Fresh server (threshold) | 3 | 3 | No abort | PASS |
|
||||
| Minimal server | 2 | 2 | No abort | PASS |
|
||||
| Empty PM2 | 0 | 0 | No abort | PASS |
|
||||
| Filter bug - 10 processes | 10 | 10 | ABORT | PASS |
|
||||
| Filter bug - 15 processes | 15 | 15 | ABORT | PASS |
|
||||
| Filter bug - 4 processes | 4 | 4 | ABORT | PASS |
|
||||
| Filter bug - 100 processes | 100 | 100 | ABORT | PASS |
|
||||
|
||||
### YAML Validation
|
||||
|
||||
All workflow files passed YAML syntax validation using `python -c "import yaml; yaml.safe_load(open(...))"`
|
||||
|
||||
---
|
||||
|
||||
## Documentation Updates
|
||||
|
||||
### CLAUDE.md Updates
|
||||
|
||||
Added new section at line 293: **PM2 Process Isolation Incidents**
|
||||
|
||||
Contains:
|
||||
|
||||
- Reference to the 2026-02-17 incident
|
||||
- Impact summary
|
||||
- Prevention measures list
|
||||
- Response instructions
|
||||
- Links to related documentation
|
||||
|
||||
### docs/README.md
|
||||
|
||||
Added incident report reference under **Operations > Incident Reports**.
|
||||
|
||||
### Cross-References Verified
|
||||
|
||||
| Document | Reference | Status |
|
||||
| --------------- | --------------------------------------- | ------ |
|
||||
| CLAUDE.md | PM2-INCIDENT-RESPONSE.md | Valid |
|
||||
| CLAUDE.md | INCIDENT-2026-02-17-PM2-PROCESS-KILL.md | Valid |
|
||||
| Incident Report | CLAUDE.md PM2 section | Valid |
|
||||
| Incident Report | PM2-INCIDENT-RESPONSE.md | Valid |
|
||||
| docs/README.md | INCIDENT-2026-02-17-PM2-PROCESS-KILL.md | Valid |
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### Technical Lessons
|
||||
|
||||
1. **Filter logic alone is not sufficient** - Even correct filters can be bypassed if an older version of the script is executed.
|
||||
|
||||
2. **Workflow caching is a real risk** - CI/CD runners may cache workflow files, leading to stale versions being executed.
|
||||
|
||||
3. **Defense-in-depth is essential for destructive operations** - Multiple layers of validation catch failures that single-point checks miss.
|
||||
|
||||
4. **Visibility enables diagnosis** - Pre/post state logging makes root cause analysis possible.
|
||||
|
||||
5. **Automatic abort prevents cascading failures** - The process count validation could have prevented the incident entirely.
|
||||
|
||||
### Process Lessons
|
||||
|
||||
1. **Shared PM2 daemons are risky** - Multiple applications sharing a PM2 daemon create cross-application dependencies.
|
||||
|
||||
2. **Documentation should include failure modes** - CLAUDE.md now explicitly documents what can go wrong and how to respond.
|
||||
|
||||
3. **Runbooks save time during incidents** - The incident response runbook provides step-by-step guidance when time is critical.
|
||||
|
||||
---
|
||||
|
||||
## Future Considerations
|
||||
|
||||
### Not Implemented (Potential Future Work)
|
||||
|
||||
1. **PM2 Namespacing** - Use PM2's native namespace feature to completely isolate environments.
|
||||
|
||||
2. **Separate PM2 Daemons** - Run one PM2 daemon per application to eliminate cross-application risk.
|
||||
|
||||
3. **Deployment Locks** - Implement mutex-style locks to prevent concurrent deployments.
|
||||
|
||||
4. **Workflow Version Verification** - Add a pre-flight check that compares workflow hash against expected value.
|
||||
|
||||
5. **Automated Rollback** - Implement automatic process restoration if safeguards detect a problem.
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **ADR-061**: [PM2 Process Isolation Safeguards](../../adr/0061-pm2-process-isolation-safeguards.md)
|
||||
- **Incident Report**: [INCIDENT-2026-02-17-PM2-PROCESS-KILL.md](../../operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md)
|
||||
- **Response Runbook**: [PM2-INCIDENT-RESPONSE.md](../../operations/PM2-INCIDENT-RESPONSE.md)
|
||||
- **CLAUDE.md Section**: [PM2 Process Isolation Incidents](../../../CLAUDE.md#pm2-process-isolation-incidents)
|
||||
- **Test Artifact**: [test-pm2-safeguard-logic.js](../../../tests/qa/test-pm2-safeguard-logic.js)
|
||||
- **ADR-014**: [Containerization and Deployment Strategy](../../adr/0014-containerization-and-deployment-strategy.md)
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Workflow Changes Summary
|
||||
|
||||
### deploy-to-prod.yml
|
||||
|
||||
```diff
|
||||
+ - name: Log Workflow Metadata
|
||||
+ run: |
|
||||
+ echo "=== WORKFLOW METADATA ==="
|
||||
+ echo "Workflow file: deploy-to-prod.yml"
|
||||
+ echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
|
||||
+ ...
|
||||
|
||||
- name: Install Backend Dependencies and Restart Production Server
|
||||
run: |
|
||||
+ # === PRE-CLEANUP PM2 STATE LOGGING ===
|
||||
+ echo "=== PRE-CLEANUP PM2 STATE ==="
|
||||
+ pm2 jlist
|
||||
+ echo "=== END PRE-CLEANUP STATE ==="
|
||||
+
|
||||
# --- Cleanup Errored Processes with Defense-in-Depth Safeguards ---
|
||||
node -e "
|
||||
...
|
||||
+ // SAFEGUARD 1: Process count validation
|
||||
+ if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
|
||||
+ console.error('SAFETY ABORT: Filter would delete ALL processes!');
|
||||
+ process.exit(1);
|
||||
+ }
|
||||
+
|
||||
+ // SAFEGUARD 2: Explicit name verification
|
||||
+ console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
|
||||
+ targetProcesses.forEach(p => {
|
||||
+ console.log(' - ' + p.name + ' (status: ' + p.pm2_env.status + ')');
|
||||
+ });
|
||||
...
|
||||
"
|
||||
+
|
||||
+ # === POST-CLEANUP VERIFICATION ===
|
||||
+ echo "=== POST-CLEANUP VERIFICATION ==="
|
||||
+ pm2 jlist | node -e "..."
|
||||
+ echo "=== END POST-CLEANUP VERIFICATION ==="
|
||||
```
|
||||
|
||||
Similar changes were applied to `deploy-to-test.yml` and `manual-deploy-major.yml`.
|
||||
|
||||
---
|
||||
|
||||
## Session Participants
|
||||
|
||||
| Role | Agent Type | Responsibility |
|
||||
| ------------ | ------------------------- | ------------------------------------- |
|
||||
| Orchestrator | Main Claude | Session coordination and delegation |
|
||||
| Planner | planner subagent | Incident analysis and solution design |
|
||||
| Documenter | describer-for-ai subagent | Incident report creation |
|
||||
| Coder #1 | coder subagent | Workflow safeguard implementation |
|
||||
| Coder #2 | coder subagent | Incident response runbook creation |
|
||||
| Coder #3 | coder subagent | CLAUDE.md updates |
|
||||
| Tester | tester subagent | Comprehensive validation |
|
||||
| Archivist | Lead Technical Archivist | Final documentation |
|
||||
|
||||
---
|
||||
|
||||
## Revision History
|
||||
|
||||
| Date | Author | Change |
|
||||
| ---------- | ------------------------ | ----------------------- |
|
||||
| 2026-02-17 | Lead Technical Archivist | Initial session summary |
|
||||
Reference in New Issue
Block a user