Files
flyer-crawler.projectium.com/docs/operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md
Torben Sorensen c059b30201
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s
PM2 Process Isolation
2026-02-17 20:49:01 -08:00

9.0 KiB

Incident Report: PM2 Process Kill During v0.15.0 Deployment

Date: 2026-02-17 Severity: Critical Status: Mitigated - Safeguards Implemented Affected Systems: All PM2-managed applications on projectium.com server


Resolution Summary

Safeguards implemented on 2026-02-17 to prevent recurrence:

  1. Workflow metadata logging (audit trail)
  2. Pre-cleanup PM2 state logging (forensics)
  3. Process count validation with SAFETY ABORT (automatic prevention)
  4. Explicit name verification (visibility)
  5. Post-cleanup verification (environment isolation check)

Documentation created:


Summary

During v0.15.0 production deployment, ALL PM2 processes on the server were terminated, not just flyer-crawler processes. This caused unplanned downtime for other applications including stock-alert.

Timeline

Time (Approx) Event
2026-02-17 ~07:40 UTC v0.15.0 production deployment triggered via deploy-to-prod.yml
Unknown All PM2 processes killed (flyer-crawler AND other apps)
Unknown Incident discovered - stock-alert down
2026-02-17 Investigation initiated
2026-02-17 Defense-in-depth safeguards implemented in all workflows
2026-02-17 Incident response runbook created
2026-02-17 Status changed to Mitigated

Impact

  • Affected Applications: All PM2-managed processes on projectium.com
    • flyer-crawler-api, flyer-crawler-worker, flyer-crawler-analytics-worker (expected)
    • stock-alert (NOT expected - collateral damage)
    • Potentially other unidentified applications
  • Downtime Duration: TBD
  • User Impact: Service unavailability for all affected applications

Investigation Findings

Deployment Workflow Analysis

All deployment workflows were reviewed for PM2 process isolation:

Workflow PM2 Isolation Implementation
deploy-to-prod.yml Whitelist prodProcesses = ['flyer-crawler-api', 'flyer-crawler-worker', 'flyer-crawler-analytics-worker']
deploy-to-test.yml Pattern p.name.endsWith('-test')
manual-deploy-major.yml Whitelist Same as deploy-to-prod
manual-db-restore.yml Explicit names pm2 stop flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker

Fix Commit Already In Place

The PM2 process isolation fix was implemented in commit b6a62a0 (2026-02-13):

commit b6a62a036f39ac895271402a61e5cc4227369de7
Author: Torben Sorensen <torben.sorensen@gmail.com>
Date:   Fri Feb 13 10:19:28 2026 -0800

    be specific about pm2 processes

Files modified:
 .gitea/workflows/deploy-to-prod.yml
 .gitea/workflows/deploy-to-test.yml
 .gitea/workflows/manual-db-restore.yml
 .gitea/workflows/manual-deploy-major.yml
 CLAUDE.md

v0.15.0 Release Contains Fix

Confirmed: v0.15.0 (commit 93ad624, 2026-02-18) includes the fix commit:

93ad624 ci: Bump version to 0.15.0 for production release [skip ci]
...
b6a62a0 be specific about pm2 processes  <-- Fix commit included

Current Workflow PM2 Commands

Production Deploy (deploy-to-prod.yml line 170):

const prodProcesses = [
  'flyer-crawler-api',
  'flyer-crawler-worker',
  'flyer-crawler-analytics-worker',
];
list.forEach((p) => {
  if (
    (p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
    prodProcesses.includes(p.name)
  ) {
    exec('pm2 delete ' + p.pm2_env.pm_id);
  }
});

Test Deploy (deploy-to-test.yml line 100):

list.forEach((p) => {
  if (p.name && p.name.endsWith('-test')) {
    exec('pm2 delete ' + p.pm2_env.pm_id);
  }
});

Both implementations have proper name filtering and should NOT affect non-flyer-crawler processes.


Discrepancy Analysis

Key Mystery

If the fixes are in place, why did ALL processes get killed?

Possible Explanations

1. Workflow Version Mismatch (HIGH PROBABILITY)

Hypothesis: Gitea runner cached an older version of the workflow file.

  • Gitea Actions may cache workflow definitions
  • The runner might have executed an older version without the fix
  • Need to verify: What version of deploy-to-prod.yml actually executed?

Investigation Required:

  • Check Gitea workflow execution logs for actual script content
  • Verify runner workflow caching behavior
  • Compare executed workflow vs repository version

2. Concurrent Workflow Execution (MEDIUM PROBABILITY)

Hypothesis: Another workflow ran simultaneously with destructive PM2 commands.

Workflows with potential issues:

  • manual-db-reset-prod.yml - Does NOT restart PM2 (schema reset only)
  • manual-redis-flush-prod.yml - Does NOT touch PM2
  • Test deployment concurrent with prod deployment

Investigation Required:

  • Check Gitea Actions history for concurrent workflow runs
  • Review timestamps of all workflow executions on 2026-02-17

3. Manual SSH Command (MEDIUM PROBABILITY)

Hypothesis: Someone SSH'd to the server and ran pm2 stop all or pm2 delete all manually.

Investigation Required:

  • Check server shell history (if available)
  • Review any maintenance windows or manual interventions
  • Ask team members about manual actions

4. PM2 Internal Issue (LOW PROBABILITY)

Hypothesis: PM2 daemon crash or corruption caused all processes to stop.

Investigation Required:

  • Check PM2 daemon logs on server
  • Look for OOM killer events in system logs
  • Check disk space issues during deployment

5. Script Execution Error (LOW PROBABILITY)

Hypothesis: JavaScript parsing error caused the filtering logic to be bypassed.

Investigation Required:

  • Review workflow execution logs for JavaScript errors
  • Test the inline Node.js scripts locally
  • Check for shell escaping issues

Documentation/Code Gaps Identified

CLAUDE.md Documentation

The PM2 isolation rules are documented in CLAUDE.md, but:

  • Documentation uses pm2 restart all in the Quick Reference table (for dev container - acceptable)
  • Multiple docs still reference pm2 restart all without environment context
  • No incident response runbook for PM2 issues

Workflow Gaps

  1. No Workflow Audit Trail: No logging of which exact workflow version executed
  2. No Pre-deployment Verification: Workflows don't log PM2 state before modifications
  3. No Cross-Application Impact Assessment: No mechanism to detect/warn about other apps

Next Steps for Root Cause Analysis

Immediate (Priority 1)

  1. Retrieve Gitea Actions execution logs for v0.15.0 deployment
  2. Extract actual executed workflow content from logs
  3. Check for concurrent workflow executions on 2026-02-17
  4. Review server PM2 daemon logs around incident time

Short-term (Priority 2)

  1. Implement pre-deployment PM2 state logging in workflows
  2. Add workflow version hash logging for audit trail
  3. Create incident response runbook for PM2/deployment issues

Long-term (Priority 3)

  1. Evaluate PM2 namespacing for complete process isolation
  2. Consider separate PM2 daemon per application
  3. Implement deployment monitoring/alerting


Appendix: Commit Timeline

93ad624 ci: Bump version to 0.15.0 for production release [skip ci]  <-- v0.15.0 release
7dd4f21 ci: Bump version to 0.14.4 [skip ci]
174b637 even more typescript fixes
4f80baf ci: Bump version to 0.14.3 [skip ci]
8450b5e Generate TSOA Spec and Routes
e4d830a ci: Bump version to 0.14.2 [skip ci]
b6a62a0 be specific about pm2 processes                               <-- PM2 fix commit
2d2cd52 Massive Dependency Modernization Project

Revision History

Date Author Change
2026-02-17 Investigation Team Initial incident report