Files
flyer-crawler.projectium.com/docs/archive/sessions/PM2_SAFEGUARDS_SESSION_2026-02-17.md
Torben Sorensen c059b30201
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s
PM2 Process Isolation
2026-02-17 20:49:01 -08:00

15 KiB

PM2 Process Isolation Safeguards Project

Session Date: 2026-02-17 Status: Completed Triggered By: Critical production incident during v0.15.0 deployment


Executive Summary

On 2026-02-17, a critical incident occurred during v0.15.0 production deployment where ALL PM2 processes on the production server were killed, not just the flyer-crawler processes. This caused unplanned downtime for multiple applications including stock-alert.projectium.com.

Despite PM2 process isolation fixes already being in place (commit b6a62a0), the incident still occurred. Investigation suggests the Gitea runner may have executed a cached/older version of the workflow files. In response, we implemented a comprehensive defense-in-depth strategy with 5 layers of safeguards across all deployment workflows.


Incident Background

What Happened

Aspect Detail
Date/Time 2026-02-17 ~07:40 UTC
Trigger v0.15.0 production deployment via deploy-to-prod.yml
Impact ALL PM2 processes killed (all environments)
Collateral Damage stock-alert.projectium.com and other PM2-managed apps
Severity P1 - Critical

Key Mystery

The PM2 process isolation fix was already implemented in commit b6a62a0 (2026-02-13) and was included in v0.15.0. The fix correctly used whitelist-based filtering:

const prodProcesses = [
  'flyer-crawler-api',
  'flyer-crawler-worker',
  'flyer-crawler-analytics-worker',
];
list.forEach((p) => {
  if (
    (p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
    prodProcesses.includes(p.name)
  ) {
    exec('pm2 delete ' + p.pm2_env.pm_id);
  }
});

Hypothesis: Gitea runner executed a cached older version of the workflow file that did not contain the fix.


Solution: Defense-in-Depth Safeguards

Rather than relying solely on the filter logic (which may be correct but not executed), we implemented 5 layers of safeguards that provide visibility, validation, and automatic abort capabilities.

Safeguard Layers

Layer Name Purpose
1 Workflow Metadata Logging Audit trail of which workflow version actually executed
2 Pre-Cleanup PM2 State Logging Capture full process list before any modifications
3 Process Count Validation SAFETY ABORT if filter would delete ALL processes
4 Explicit Name Verification Log exactly which processes will be affected
5 Post-Cleanup Verification Verify environment isolation after cleanup

Layer Details

Layer 1: Workflow Metadata Logging

Logs at the start of deployment:

  • Workflow file name
  • SHA-256 hash of the workflow file
  • Git commit being deployed
  • Git branch
  • Timestamp (UTC)
  • Actor (who triggered the deployment)

Purpose: If an incident occurs, we can verify whether the executed workflow matches the repository version.

echo "=== WORKFLOW METADATA ==="
echo "Workflow file: deploy-to-prod.yml"
echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
echo "Git commit: $(git rev-parse HEAD)"
echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo "Actor: ${{ gitea.actor }}"
echo "=== END METADATA ==="

Layer 2: Pre-Cleanup PM2 State Logging

Captures full PM2 process list in JSON format before any modifications.

Purpose: Provides forensic evidence of what processes existed before cleanup began.

echo "=== PRE-CLEANUP PM2 STATE ==="
pm2 jlist
echo "=== END PRE-CLEANUP STATE ==="

Layer 3: Process Count Validation (SAFETY ABORT)

The most critical safeguard. Aborts the entire deployment if the filter would delete ALL processes and there are more than 3 processes total.

Purpose: Catches filter bugs or unexpected conditions that would result in catastrophic process deletion.

// SAFEGUARD 1: Process count validation
const totalProcesses = list.length;
if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
  console.error('SAFETY ABORT: Filter would delete ALL processes!');
  console.error(
    'Total processes: ' + totalProcesses + ', Target processes: ' + targetProcesses.length,
  );
  console.error('This indicates a potential filter bug. Aborting cleanup.');
  process.exit(1);
}

Threshold Rationale: The threshold of 3 allows normal operation when only the 3 expected processes exist (API, Worker, Analytics Worker) while catching anomalies when the server hosts more applications.

Layer 4: Explicit Name Verification

Logs the exact name, status, and PM2 ID of each process that will be deleted.

Purpose: Provides clear visibility into what the cleanup operation will actually do.

console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
targetProcesses.forEach((p) => {
  console.log(
    '  - ' + p.name + ' (status: ' + p.pm2_env.status + ', pm_id: ' + p.pm2_env.pm_id + ')',
  );
});

Layer 5: Post-Cleanup Verification

After cleanup, logs the state of processes by environment to verify isolation was maintained.

Purpose: Immediately identifies if the cleanup affected the wrong environment.

echo "=== POST-CLEANUP VERIFICATION ==="
pm2 jlist | node -e "
  const list = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
  const prodProcesses = list.filter(p => p.name && p.name.startsWith('flyer-crawler-') && !p.name.endsWith('-test'));
  const testProcesses = list.filter(p => p.name && p.name.endsWith('-test'));
  console.log('Production processes after cleanup: ' + prodProcesses.length);
  console.log('Test processes (should be untouched): ' + testProcesses.length);
"
echo "=== END POST-CLEANUP VERIFICATION ==="

Implementation Details

Files Modified

File Changes
.gitea/workflows/deploy-to-prod.yml Added all 5 safeguard layers
.gitea/workflows/deploy-to-test.yml Added all 5 safeguard layers
.gitea/workflows/manual-deploy-major.yml Added all 5 safeguard layers
CLAUDE.md Added PM2 Process Isolation Incidents section

Files Created

File Purpose
docs/operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md Detailed incident report
docs/operations/PM2-INCIDENT-RESPONSE.md Comprehensive incident response runbook
tests/qa/test-pm2-safeguard-logic.js Validation tests for safeguard logic

Testing and Validation

Test Artifact

A standalone JavaScript test file was created to validate the safeguard logic:

File: tests/qa/test-pm2-safeguard-logic.js

Test Categories:

  1. Normal Operations (should NOT abort)

    • 3 errored out of 15 processes
    • 1 errored out of 10 processes
    • 0 processes to clean
    • Fresh server with 3 processes (threshold boundary)
  2. Dangerous Operations (SHOULD abort)

    • All 10 processes targeted
    • All 15 processes targeted
    • All 4 processes targeted (just above threshold)
  3. Workflow-Specific Filter Tests

    • Production filter only matches production processes
    • Test filter only matches -test suffix processes
    • Filters don't cross-contaminate environments

Test Results

All 11 scenarios passed:

Scenario Total Target Expected Result
Normal prod cleanup 15 3 No abort PASS
Normal test cleanup 15 3 No abort PASS
Single process 10 1 No abort PASS
No cleanup needed 10 0 No abort PASS
Fresh server (threshold) 3 3 No abort PASS
Minimal server 2 2 No abort PASS
Empty PM2 0 0 No abort PASS
Filter bug - 10 processes 10 10 ABORT PASS
Filter bug - 15 processes 15 15 ABORT PASS
Filter bug - 4 processes 4 4 ABORT PASS
Filter bug - 100 processes 100 100 ABORT PASS

YAML Validation

All workflow files passed YAML syntax validation using python -c "import yaml; yaml.safe_load(open(...))"


Documentation Updates

CLAUDE.md Updates

Added new section at line 293: PM2 Process Isolation Incidents

Contains:

  • Reference to the 2026-02-17 incident
  • Impact summary
  • Prevention measures list
  • Response instructions
  • Links to related documentation

docs/README.md

Added incident report reference under Operations > Incident Reports.

Cross-References Verified

Document Reference Status
CLAUDE.md PM2-INCIDENT-RESPONSE.md Valid
CLAUDE.md INCIDENT-2026-02-17-PM2-PROCESS-KILL.md Valid
Incident Report CLAUDE.md PM2 section Valid
Incident Report PM2-INCIDENT-RESPONSE.md Valid
docs/README.md INCIDENT-2026-02-17-PM2-PROCESS-KILL.md Valid

Lessons Learned

Technical Lessons

  1. Filter logic alone is not sufficient - Even correct filters can be bypassed if an older version of the script is executed.

  2. Workflow caching is a real risk - CI/CD runners may cache workflow files, leading to stale versions being executed.

  3. Defense-in-depth is essential for destructive operations - Multiple layers of validation catch failures that single-point checks miss.

  4. Visibility enables diagnosis - Pre/post state logging makes root cause analysis possible.

  5. Automatic abort prevents cascading failures - The process count validation could have prevented the incident entirely.

Process Lessons

  1. Shared PM2 daemons are risky - Multiple applications sharing a PM2 daemon create cross-application dependencies.

  2. Documentation should include failure modes - CLAUDE.md now explicitly documents what can go wrong and how to respond.

  3. Runbooks save time during incidents - The incident response runbook provides step-by-step guidance when time is critical.


Future Considerations

Not Implemented (Potential Future Work)

  1. PM2 Namespacing - Use PM2's native namespace feature to completely isolate environments.

  2. Separate PM2 Daemons - Run one PM2 daemon per application to eliminate cross-application risk.

  3. Deployment Locks - Implement mutex-style locks to prevent concurrent deployments.

  4. Workflow Version Verification - Add a pre-flight check that compares workflow hash against expected value.

  5. Automated Rollback - Implement automatic process restoration if safeguards detect a problem.



Appendix: Workflow Changes Summary

deploy-to-prod.yml

+ - name: Log Workflow Metadata
+     run: |
+       echo "=== WORKFLOW METADATA ==="
+       echo "Workflow file: deploy-to-prod.yml"
+       echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
+       ...

  - name: Install Backend Dependencies and Restart Production Server
      run: |
+       # === PRE-CLEANUP PM2 STATE LOGGING ===
+       echo "=== PRE-CLEANUP PM2 STATE ==="
+       pm2 jlist
+       echo "=== END PRE-CLEANUP STATE ==="
+
        # --- Cleanup Errored Processes with Defense-in-Depth Safeguards ---
        node -e "
          ...
+         // SAFEGUARD 1: Process count validation
+         if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
+           console.error('SAFETY ABORT: Filter would delete ALL processes!');
+           process.exit(1);
+         }
+
+         // SAFEGUARD 2: Explicit name verification
+         console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
+         targetProcesses.forEach(p => {
+           console.log('  - ' + p.name + ' (status: ' + p.pm2_env.status + ')');
+         });
          ...
        "
+
+       # === POST-CLEANUP VERIFICATION ===
+       echo "=== POST-CLEANUP VERIFICATION ==="
+       pm2 jlist | node -e "..."
+       echo "=== END POST-CLEANUP VERIFICATION ==="

Similar changes were applied to deploy-to-test.yml and manual-deploy-major.yml.


Session Participants

Role Agent Type Responsibility
Orchestrator Main Claude Session coordination and delegation
Planner planner subagent Incident analysis and solution design
Documenter describer-for-ai subagent Incident report creation
Coder #1 coder subagent Workflow safeguard implementation
Coder #2 coder subagent Incident response runbook creation
Coder #3 coder subagent CLAUDE.md updates
Tester tester subagent Comprehensive validation
Archivist Lead Technical Archivist Final documentation

Revision History

Date Author Change
2026-02-17 Lead Technical Archivist Initial session summary