torbo/flyer-crawler.projectium.com

Fork 0

Files

Torben Sorensen c059b30201

Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s

Details

PM2 Process Isolation

2026-02-17 20:49:01 -08:00

15 KiB

Raw Blame History

PM2 Process Isolation Safeguards Project

Session Date: 2026-02-17 Status: Completed Triggered By: Critical production incident during v0.15.0 deployment

Executive Summary

On 2026-02-17, a critical incident occurred during v0.15.0 production deployment where ALL PM2 processes on the production server were killed, not just the flyer-crawler processes. This caused unplanned downtime for multiple applications including stock-alert.projectium.com.

Despite PM2 process isolation fixes already being in place (commit b6a62a0), the incident still occurred. Investigation suggests the Gitea runner may have executed a cached/older version of the workflow files. In response, we implemented a comprehensive defense-in-depth strategy with 5 layers of safeguards across all deployment workflows.

Incident Background

What Happened

Aspect	Detail
Date/Time	2026-02-17 ~07:40 UTC
Trigger	v0.15.0 production deployment via `deploy-to-prod.yml`
Impact	ALL PM2 processes killed (all environments)
Collateral Damage	`stock-alert.projectium.com` and other PM2-managed apps
Severity	P1 - Critical

Key Mystery

The PM2 process isolation fix was already implemented in commit b6a62a0 (2026-02-13) and was included in v0.15.0. The fix correctly used whitelist-based filtering:

const prodProcesses = [
  'flyer-crawler-api',
  'flyer-crawler-worker',
  'flyer-crawler-analytics-worker',
];
list.forEach((p) => {
  if (
    (p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
    prodProcesses.includes(p.name)
  ) {
    exec('pm2 delete ' + p.pm2_env.pm_id);
  }
});

Hypothesis: Gitea runner executed a cached older version of the workflow file that did not contain the fix.

Solution: Defense-in-Depth Safeguards

Rather than relying solely on the filter logic (which may be correct but not executed), we implemented 5 layers of safeguards that provide visibility, validation, and automatic abort capabilities.

Safeguard Layers

Layer	Name	Purpose
1	Workflow Metadata Logging	Audit trail of which workflow version actually executed
2	Pre-Cleanup PM2 State Logging	Capture full process list before any modifications
3	Process Count Validation	SAFETY ABORT if filter would delete ALL processes
4	Explicit Name Verification	Log exactly which processes will be affected
5	Post-Cleanup Verification	Verify environment isolation after cleanup

Layer Details

Layer 1: Workflow Metadata Logging

Logs at the start of deployment:

Workflow file name
SHA-256 hash of the workflow file
Git commit being deployed
Git branch
Timestamp (UTC)
Actor (who triggered the deployment)

Purpose: If an incident occurs, we can verify whether the executed workflow matches the repository version.

echo "=== WORKFLOW METADATA ==="
echo "Workflow file: deploy-to-prod.yml"
echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
echo "Git commit: $(git rev-parse HEAD)"
echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo "Actor: ${{ gitea.actor }}"
echo "=== END METADATA ==="

Layer 2: Pre-Cleanup PM2 State Logging

Captures full PM2 process list in JSON format before any modifications.

Purpose: Provides forensic evidence of what processes existed before cleanup began.

echo "=== PRE-CLEANUP PM2 STATE ==="
pm2 jlist
echo "=== END PRE-CLEANUP STATE ==="

Layer 3: Process Count Validation (SAFETY ABORT)

The most critical safeguard. Aborts the entire deployment if the filter would delete ALL processes and there are more than 3 processes total.

Purpose: Catches filter bugs or unexpected conditions that would result in catastrophic process deletion.

// SAFEGUARD 1: Process count validation
const totalProcesses = list.length;
if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
  console.error('SAFETY ABORT: Filter would delete ALL processes!');
  console.error(
    'Total processes: ' + totalProcesses + ', Target processes: ' + targetProcesses.length,
  );
  console.error('This indicates a potential filter bug. Aborting cleanup.');
  process.exit(1);
}

Threshold Rationale: The threshold of 3 allows normal operation when only the 3 expected processes exist (API, Worker, Analytics Worker) while catching anomalies when the server hosts more applications.

Layer 4: Explicit Name Verification

Logs the exact name, status, and PM2 ID of each process that will be deleted.

Purpose: Provides clear visibility into what the cleanup operation will actually do.

console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
targetProcesses.forEach((p) => {
  console.log(
    '  - ' + p.name + ' (status: ' + p.pm2_env.status + ', pm_id: ' + p.pm2_env.pm_id + ')',
  );
});

Layer 5: Post-Cleanup Verification

After cleanup, logs the state of processes by environment to verify isolation was maintained.

Purpose: Immediately identifies if the cleanup affected the wrong environment.

echo "=== POST-CLEANUP VERIFICATION ==="
pm2 jlist | node -e "
  const list = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
  const prodProcesses = list.filter(p => p.name && p.name.startsWith('flyer-crawler-') && !p.name.endsWith('-test'));
  const testProcesses = list.filter(p => p.name && p.name.endsWith('-test'));
  console.log('Production processes after cleanup: ' + prodProcesses.length);
  console.log('Test processes (should be untouched): ' + testProcesses.length);
"
echo "=== END POST-CLEANUP VERIFICATION ==="

Implementation Details

Files Modified

File	Changes
`.gitea/workflows/deploy-to-prod.yml`	Added all 5 safeguard layers
`.gitea/workflows/deploy-to-test.yml`	Added all 5 safeguard layers
`.gitea/workflows/manual-deploy-major.yml`	Added all 5 safeguard layers
`CLAUDE.md`	Added PM2 Process Isolation Incidents section

Files Created

File	Purpose
`docs/operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md`	Detailed incident report
`docs/operations/PM2-INCIDENT-RESPONSE.md`	Comprehensive incident response runbook
`tests/qa/test-pm2-safeguard-logic.js`	Validation tests for safeguard logic

Testing and Validation

Test Artifact

A standalone JavaScript test file was created to validate the safeguard logic:

File: tests/qa/test-pm2-safeguard-logic.js

Test Categories:

Normal Operations (should NOT abort)
- 3 errored out of 15 processes
- 1 errored out of 10 processes
- 0 processes to clean
- Fresh server with 3 processes (threshold boundary)
Dangerous Operations (SHOULD abort)
- All 10 processes targeted
- All 15 processes targeted
- All 4 processes targeted (just above threshold)
Workflow-Specific Filter Tests
- Production filter only matches production processes
- Test filter only matches -test suffix processes
- Filters don't cross-contaminate environments

Test Results

All 11 scenarios passed:

Scenario	Total	Target	Expected	Result
Normal prod cleanup	15	3	No abort	PASS
Normal test cleanup	15	3	No abort	PASS
Single process	10	1	No abort	PASS
No cleanup needed	10	0	No abort	PASS
Fresh server (threshold)	3	3	No abort	PASS
Minimal server	2	2	No abort	PASS
Empty PM2	0	0	No abort	PASS
Filter bug - 10 processes	10	10	ABORT	PASS
Filter bug - 15 processes	15	15	ABORT	PASS
Filter bug - 4 processes	4	4	ABORT	PASS
Filter bug - 100 processes	100	100	ABORT	PASS

YAML Validation

All workflow files passed YAML syntax validation using python -c "import yaml; yaml.safe_load(open(...))"

Documentation Updates

CLAUDE.md Updates

Added new section at line 293: PM2 Process Isolation Incidents

Contains:

Reference to the 2026-02-17 incident
Impact summary
Prevention measures list
Response instructions
Links to related documentation

docs/README.md

Added incident report reference under Operations > Incident Reports.

Cross-References Verified

Document	Reference	Status
CLAUDE.md	PM2-INCIDENT-RESPONSE.md	Valid
CLAUDE.md	INCIDENT-2026-02-17-PM2-PROCESS-KILL.md	Valid
Incident Report	CLAUDE.md PM2 section	Valid
Incident Report	PM2-INCIDENT-RESPONSE.md	Valid
docs/README.md	INCIDENT-2026-02-17-PM2-PROCESS-KILL.md	Valid

Lessons Learned

Technical Lessons

Filter logic alone is not sufficient - Even correct filters can be bypassed if an older version of the script is executed.
Workflow caching is a real risk - CI/CD runners may cache workflow files, leading to stale versions being executed.
Defense-in-depth is essential for destructive operations - Multiple layers of validation catch failures that single-point checks miss.
Visibility enables diagnosis - Pre/post state logging makes root cause analysis possible.
Automatic abort prevents cascading failures - The process count validation could have prevented the incident entirely.

Process Lessons

Shared PM2 daemons are risky - Multiple applications sharing a PM2 daemon create cross-application dependencies.
Documentation should include failure modes - CLAUDE.md now explicitly documents what can go wrong and how to respond.
Runbooks save time during incidents - The incident response runbook provides step-by-step guidance when time is critical.

Future Considerations

Not Implemented (Potential Future Work)

PM2 Namespacing - Use PM2's native namespace feature to completely isolate environments.
Separate PM2 Daemons - Run one PM2 daemon per application to eliminate cross-application risk.
Deployment Locks - Implement mutex-style locks to prevent concurrent deployments.
Workflow Version Verification - Add a pre-flight check that compares workflow hash against expected value.
Automated Rollback - Implement automatic process restoration if safeguards detect a problem.

ADR-061: PM2 Process Isolation Safeguards
Incident Report: INCIDENT-2026-02-17-PM2-PROCESS-KILL.md
Response Runbook: PM2-INCIDENT-RESPONSE.md
CLAUDE.md Section: PM2 Process Isolation Incidents
Test Artifact: test-pm2-safeguard-logic.js
ADR-014: Containerization and Deployment Strategy

Appendix: Workflow Changes Summary

deploy-to-prod.yml

+ - name: Log Workflow Metadata
+     run: |
+       echo "=== WORKFLOW METADATA ==="
+       echo "Workflow file: deploy-to-prod.yml"
+       echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
+       ...

  - name: Install Backend Dependencies and Restart Production Server
      run: |
+       # === PRE-CLEANUP PM2 STATE LOGGING ===
+       echo "=== PRE-CLEANUP PM2 STATE ==="
+       pm2 jlist
+       echo "=== END PRE-CLEANUP STATE ==="
+
        # --- Cleanup Errored Processes with Defense-in-Depth Safeguards ---
        node -e "
          ...
+         // SAFEGUARD 1: Process count validation
+         if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
+           console.error('SAFETY ABORT: Filter would delete ALL processes!');
+           process.exit(1);
+         }
+
+         // SAFEGUARD 2: Explicit name verification
+         console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
+         targetProcesses.forEach(p => {
+           console.log('  - ' + p.name + ' (status: ' + p.pm2_env.status + ')');
+         });
          ...
        "
+
+       # === POST-CLEANUP VERIFICATION ===
+       echo "=== POST-CLEANUP VERIFICATION ==="
+       pm2 jlist | node -e "..."
+       echo "=== END POST-CLEANUP VERIFICATION ==="

Similar changes were applied to deploy-to-test.yml and manual-deploy-major.yml.

Session Participants

Role	Agent Type	Responsibility
Orchestrator	Main Claude	Session coordination and delegation
Planner	planner subagent	Incident analysis and solution design
Documenter	describer-for-ai subagent	Incident report creation
Coder #1	coder subagent	Workflow safeguard implementation
Coder #2	coder subagent	Incident response runbook creation
Coder #3	coder subagent	CLAUDE.md updates
Tester	tester subagent	Comprehensive validation
Archivist	Lead Technical Archivist	Final documentation

Revision History

Date	Author	Change
2026-02-17	Lead Technical Archivist	Initial session summary

15 KiB Raw Blame History