Files
flyer-crawler.projectium.com/docs/operations/PM2-INCIDENT-RESPONSE.md
Torben Sorensen c059b30201
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s
PM2 Process Isolation
2026-02-17 20:49:01 -08:00

22 KiB

PM2 Incident Response Runbook

Purpose: Step-by-step procedures for responding to PM2 process isolation incidents on the projectium.com server.

Audience: On-call responders, system administrators, developers with server access.

Last updated: 2026-02-17

Related documentation:


Table of Contents

  1. Quick Reference
  2. Detection
  3. Initial Assessment
  4. Immediate Response
  5. Process Restoration
  6. Root Cause Investigation
  7. Communication Templates
  8. Prevention Measures
  9. Contact Information
  10. Post-Incident Review

Quick Reference

PM2 Process Inventory

Application Environment Process Names Config File Directory
Flyer Crawler Production flyer-crawler-api, flyer-crawler-worker, flyer-crawler-analytics-worker ecosystem.config.cjs /var/www/flyer-crawler.projectium.com
Flyer Crawler Test flyer-crawler-api-test, flyer-crawler-worker-test, flyer-crawler-analytics-worker-test ecosystem-test.config.cjs /var/www/flyer-crawler-test.projectium.com
Stock Alert Production stock-alert-* (varies) /var/www/stock-alert.projectium.com

Critical Commands

# Check PM2 status
pm2 list

# Check specific process
pm2 show flyer-crawler-api

# View recent logs
pm2 logs --lines 50

# Restart specific processes (SAFE)
pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker

# DO NOT USE (affects ALL apps)
# pm2 restart all    <-- DANGEROUS
# pm2 stop all       <-- DANGEROUS
# pm2 delete all     <-- DANGEROUS

Severity Classification

Severity Criteria Response Time Example
P1 - Critical Multiple applications down, production impact Immediate (< 5 min) All PM2 processes killed
P2 - High Single application down, production impact < 15 min Flyer Crawler prod down, Stock Alert unaffected
P3 - Medium Test environment only, no production impact < 1 hour Test processes killed, production unaffected

Detection

How to Identify a PM2 Incident

Automated Indicators:

  • Health check failures on /api/health/ready
  • Monitoring alerts (UptimeRobot, etc.)
  • Bugsink showing connection errors
  • NGINX returning 502 Bad Gateway

User-Reported Symptoms:

  • "The site is down"
  • "I can't log in"
  • "Pages are loading slowly then timing out"
  • "I see a 502 error"

Manual Discovery:

# SSH to server
ssh gitea-runner@projectium.com

# Check if PM2 is running
pm2 list

# Expected output shows processes
# If empty or all errored = incident

Incident Signature: Process Isolation Violation

When a PM2 incident is caused by process isolation failure, you will see:

# Expected state (normal):
+-----------------------------------+----+-----+---------+-------+
| App name                          | id |mode | status  | cpu   |
+-----------------------------------+----+-----+---------+-------+
| flyer-crawler-api                 | 0  |clust| online  | 0%    |
| flyer-crawler-worker              | 1  |fork | online  | 0%    |
| flyer-crawler-analytics-worker    | 2  |fork | online  | 0%    |
| flyer-crawler-api-test            | 3  |fork | online  | 0%    |
| flyer-crawler-worker-test         | 4  |fork | online  | 0%    |
| flyer-crawler-analytics-worker-test| 5 |fork | online  | 0%    |
| stock-alert-api                   | 6  |fork | online  | 0%    |
+-----------------------------------+----+-----+---------+-------+

# Incident state (isolation violation):
# All processes missing or errored - not just one app
+-----------------------------------+----+-----+---------+-------+
| App name                          | id |mode | status  | cpu   |
+-----------------------------------+----+-----+---------+-------+
# (empty or all processes errored/stopped)
+-----------------------------------+----+-----+---------+-------+

Initial Assessment

Step 1: Gather Information (2 minutes)

Run these commands and capture output:

# 1. Check PM2 status
pm2 list

# 2. Check PM2 daemon status
pm2 ping

# 3. Check recent PM2 logs
pm2 logs --lines 20 --nostream

# 4. Check system status
systemctl status pm2-gitea-runner --no-pager

# 5. Check disk space
df -h /

# 6. Check memory
free -h

# 7. Check recent deployments (in app directory)
cd /var/www/flyer-crawler.projectium.com
git log --oneline -5

Step 2: Determine Scope

Question Command Impact Level
How many apps affected? pm2 list Count missing/errored processes
Is production down? curl https://flyer-crawler.projectium.com/api/health/ping Yes/No
Is test down? curl https://flyer-crawler-test.projectium.com/api/health/ping Yes/No
Are other apps affected? pm2 list | grep stock-alert Yes/No

Step 3: Classify Severity

Decision Tree:

Production app(s) down?
    |
    +-- YES: Multiple apps affected?
    |       |
    |       +-- YES --> P1 CRITICAL (all apps down)
    |       |
    |       +-- NO --> P2 HIGH (single app down)
    |
    +-- NO: Test environment only?
            |
            +-- YES --> P3 MEDIUM
            |
            +-- NO --> Investigate further

Step 4: Document Initial State

Capture this information before making any changes:

# Save PM2 state to file
pm2 jlist > /tmp/pm2-incident-$(date +%Y%m%d-%H%M%S).json

# Save system state
{
  echo "=== PM2 List ==="
  pm2 list
  echo ""
  echo "=== Disk Space ==="
  df -h
  echo ""
  echo "=== Memory ==="
  free -h
  echo ""
  echo "=== Recent Git Commits ==="
  cd /var/www/flyer-crawler.projectium.com && git log --oneline -5
} > /tmp/incident-state-$(date +%Y%m%d-%H%M%S).txt

Immediate Response

Priority 1: Stop Ongoing Deployments

If a deployment is currently running:

  1. Check Gitea Actions for running workflows
  2. Cancel any in-progress deployment workflows
  3. Do NOT start new deployments until incident resolved

Priority 2: Assess Which Processes Are Down

# Get list of processes and their status
pm2 list

# Check which processes exist but are errored/stopped
pm2 jlist | jq '.[] | {name, status: .pm2_env.status}'

Priority 3: Establish Order of Restoration

Restore in this order (production first, critical path first):

Priority Process Rationale
1 flyer-crawler-api Production API - highest user impact
2 flyer-crawler-worker Production background jobs
3 flyer-crawler-analytics-worker Production analytics
4 stock-alert-* Other production apps
5 flyer-crawler-api-test Test environment
6 flyer-crawler-worker-test Test background jobs
7 flyer-crawler-analytics-worker-test Test analytics

Process Restoration

Scenario A: Flyer Crawler Production Processes Missing

# Navigate to production directory
cd /var/www/flyer-crawler.projectium.com

# Start production processes
pm2 start ecosystem.config.cjs

# Verify processes started
pm2 list

# Check health endpoint
curl -s http://localhost:3001/api/health/ready | jq .

Scenario B: Flyer Crawler Test Processes Missing

# Navigate to test directory
cd /var/www/flyer-crawler-test.projectium.com

# Start test processes
pm2 start ecosystem-test.config.cjs

# Verify processes started
pm2 list

# Check health endpoint
curl -s http://localhost:3002/api/health/ready | jq .

Scenario C: Stock Alert Processes Missing

# Navigate to stock-alert directory
cd /var/www/stock-alert.projectium.com

# Start processes (adjust config file name as needed)
pm2 start ecosystem.config.cjs

# Verify processes started
pm2 list

Scenario D: All Processes Missing

Execute restoration in priority order:

# 1. Flyer Crawler Production (highest priority)
cd /var/www/flyer-crawler.projectium.com
pm2 start ecosystem.config.cjs

# Verify production is healthy before continuing
curl -s http://localhost:3001/api/health/ready | jq '.data.status'
# Should return "healthy"

# 2. Stock Alert Production
cd /var/www/stock-alert.projectium.com
pm2 start ecosystem.config.cjs

# 3. Flyer Crawler Test (lower priority)
cd /var/www/flyer-crawler-test.projectium.com
pm2 start ecosystem-test.config.cjs

# 4. Save PM2 process list
pm2 save

# 5. Final verification
pm2 list

Health Check Verification

After restoration, verify each application:

Flyer Crawler Production:

# API health
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'
# Expected: "healthy"

# Check all services
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.services'

Flyer Crawler Test:

curl -s https://flyer-crawler-test.projectium.com/api/health/ready | jq '.data.status'

Stock Alert:

# Adjust URL as appropriate for stock-alert
curl -s https://stock-alert.projectium.com/api/health/ready | jq '.data.status'

Verification Checklist

After restoration, confirm:

  • pm2 list shows all expected processes as online
  • Production health check returns healthy
  • Test health check returns healthy (if applicable)
  • No processes showing high restart count
  • No processes showing errored or stopped status
  • PM2 process list saved: pm2 save

Root Cause Investigation

Step 1: Check Workflow Execution Logs

# Find recent Gitea Actions runs
# (Access via Gitea web UI: Repository > Actions > Recent Runs)

# Look for these workflows:
# - deploy-to-prod.yml
# - deploy-to-test.yml
# - manual-deploy-major.yml
# - manual-db-restore.yml

Step 2: Check PM2 Daemon Logs

# PM2 daemon logs
cat ~/.pm2/pm2.log | tail -100

# PM2 process-specific logs
ls -la ~/.pm2/logs/

# Recent API logs
tail -100 ~/.pm2/logs/flyer-crawler-api-out.log
tail -100 ~/.pm2/logs/flyer-crawler-api-error.log

Step 3: Check System Logs

# System journal for PM2 service
journalctl -u pm2-gitea-runner -n 100 --no-pager

# Kernel messages (OOM killer, etc.)
journalctl -k -n 50 --no-pager | grep -i "killed\|oom\|memory"

# Authentication logs (unauthorized access)
tail -50 /var/log/auth.log

Step 4: Git History Analysis

# Recent commits to deployment workflows
cd /var/www/flyer-crawler.projectium.com
git log --oneline -20 -- .gitea/workflows/

# Check what changed in PM2 configs
git log --oneline -10 -- ecosystem.config.cjs ecosystem-test.config.cjs

# Diff against last known good state
git diff <last-good-commit> -- .gitea/workflows/ ecosystem*.cjs

Step 5: Timing Correlation

Create a timeline:

| Time (UTC) | Event | Source |
|------------|-------|--------|
| XX:XX | Last successful health check | Monitoring |
| XX:XX | Deployment workflow started | Gitea Actions |
| XX:XX | First failed health check | Monitoring |
| XX:XX | Incident detected | User report / Alert |
| XX:XX | Investigation started | On-call |

Common Root Causes

Root Cause Evidence Prevention
pm2 stop all in workflow Workflow logs show "all" command Use explicit process names
pm2 delete all in workflow Empty PM2 list after deploy Use whitelist-based deletion
OOM killer journalctl -k shows "Killed process" Increase memory limits
Disk space exhaustion df -h shows 100% Log rotation, cleanup
Manual intervention Shell history shows pm2 commands Document all manual actions
Concurrent deployments Multiple workflows at same time Implement deployment locks
Workflow caching issue Old workflow version executed Force workflow refresh

Communication Templates

Incident Notification (Internal)

Subject: [P1 INCIDENT] PM2 Process Isolation Failure - Multiple Apps Down

Status: INVESTIGATING
Time Detected: YYYY-MM-DD HH:MM UTC
Affected Systems: [flyer-crawler-prod, stock-alert-prod, ...]

Summary:
All PM2 processes on projectium.com server were terminated unexpectedly.
Multiple production applications are currently down.

Impact:
- flyer-crawler.projectium.com: DOWN
- stock-alert.projectium.com: DOWN
- [other affected apps]

Current Actions:
- Restoring critical production processes
- Investigating root cause

Next Update: In 15 minutes or upon status change

Incident Commander: [Name]

Status Update Template

Subject: [P1 INCIDENT] PM2 Process Isolation Failure - UPDATE #N

Status: [INVESTIGATING | IDENTIFIED | RESTORING | RESOLVED]
Time: YYYY-MM-DD HH:MM UTC

Progress Since Last Update:
- [Action taken]
- [Discovery made]
- [Process restored]

Current State:
- flyer-crawler.projectium.com: [UP|DOWN]
- stock-alert.projectium.com: [UP|DOWN]

Root Cause: [If identified]

Next Steps:
- [Planned action]

ETA to Resolution: [If known]

Next Update: In [X] minutes

Resolution Notification

Subject: [RESOLVED] PM2 Process Isolation Failure

Status: RESOLVED
Time Resolved: YYYY-MM-DD HH:MM UTC
Total Downtime: X minutes

Summary:
All PM2 processes have been restored. Services are operating normally.

Root Cause:
[Brief description of what caused the incident]

Impact Summary:
- flyer-crawler.projectium.com: Down for X minutes
- stock-alert.projectium.com: Down for X minutes
- Estimated user impact: [description]

Immediate Actions Taken:
1. [Action]
2. [Action]

Follow-up Actions:
1. [ ] [Preventive measure] - Owner: [Name] - Due: [Date]
2. [ ] Post-incident review scheduled for [Date]

Post-Incident Review: [Link or scheduled time]

Prevention Measures

Pre-Deployment Checklist

Before triggering any deployment:

  • Review workflow file for PM2 commands
  • Confirm no pm2 stop all, pm2 delete all, or pm2 restart all
  • Verify process names are explicitly listed
  • Check for concurrent deployment risks
  • Confirm recent workflow changes were reviewed

Workflow Review Checklist

When reviewing deployment workflow changes:

  • All PM2 stop commands use explicit process names
  • All PM2 delete commands filter by process name pattern
  • All PM2 restart commands use explicit process names
  • Test deployments filter by -test suffix
  • Production deployments use whitelist array

Safe Patterns:

// SAFE: Explicit process names (production)
const prodProcesses = [
  'flyer-crawler-api',
  'flyer-crawler-worker',
  'flyer-crawler-analytics-worker',
];
list.forEach((p) => {
  if (
    (p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
    prodProcesses.includes(p.name)
  ) {
    exec('pm2 delete ' + p.pm2_env.pm_id);
  }
});

// SAFE: Pattern-based filtering (test)
list.forEach((p) => {
  if (p.name && p.name.endsWith('-test')) {
    exec('pm2 delete ' + p.pm2_env.pm_id);
  }
});

Dangerous Patterns (NEVER USE):

# DANGEROUS - affects ALL applications
pm2 stop all
pm2 delete all
pm2 restart all

# DANGEROUS - no name filtering
pm2 delete $(pm2 jlist | jq -r '.[] | select(.pm2_env.status == "errored") | .pm_id')

PM2 Configuration Validation

Before deploying PM2 config changes:

# Test configuration locally
cd /var/www/flyer-crawler.projectium.com
node -e "console.log(JSON.stringify(require('./ecosystem.config.cjs'), null, 2))"

# Verify process names
node -e "require('./ecosystem.config.cjs').apps.forEach(a => console.log(a.name))"

# Expected output should match documented process names

Deployment Monitoring

After every deployment:

# Immediate verification
pm2 list

# Check no unexpected processes were affected
pm2 list | grep -v flyer-crawler
# Should still show other apps (e.g., stock-alert)

# Health check
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'

Contact Information

On-Call Escalation

Role Contact When to Escalate
Primary On-Call [Name/Channel] First responder
Secondary On-Call [Name/Channel] If primary unavailable after 10 min
Engineering Lead [Name/Channel] P1 incidents > 30 min
Product Owner [Name/Channel] User communication needed

External Dependencies

Service Support Channel When to Contact
Server Provider [Contact info] Hardware/network issues
DNS Provider [Contact info] DNS resolution failures
SSL Certificate [Contact info] Certificate issues

Communication Channels

Channel Purpose
#incidents Real-time incident updates
#deployments Deployment announcements
#engineering Technical discussion
Email list Formal notifications

Post-Incident Review

Incident Report Template

# Incident Report: [Title]

## Overview

| Field              | Value             |
| ------------------ | ----------------- |
| Date               | YYYY-MM-DD        |
| Duration           | X hours Y minutes |
| Severity           | P1/P2/P3          |
| Incident Commander | [Name]            |
| Status             | Resolved          |

## Timeline

| Time (UTC) | Event               |
| ---------- | ------------------- |
| HH:MM      | [Event description] |
| HH:MM      | [Event description] |

## Impact

- **Users affected**: [Number/description]
- **Revenue impact**: [If applicable]
- **SLA impact**: [If applicable]

## Root Cause

[Detailed technical explanation]

## Resolution

[What was done to resolve the incident]

## Contributing Factors

1. [Factor]
2. [Factor]

## Action Items

| Action   | Owner  | Due Date | Status |
| -------- | ------ | -------- | ------ |
| [Action] | [Name] | [Date]   | [ ]    |

## Lessons Learned

### What Went Well

- [Item]

### What Could Be Improved

- [Item]

## Appendix

- Link to monitoring data
- Link to relevant logs
- Link to workflow runs

Lessons Learned Format

Use "5 Whys" technique:

Problem: All PM2 processes were killed during deployment

Why 1: The deployment workflow ran `pm2 delete all`
Why 2: The workflow used an outdated version of the script
Why 3: Gitea runner cached the old workflow file
Why 4: No mechanism to verify workflow version before execution
Why 5: Workflow versioning and audit trail not implemented

Root Cause: Lack of workflow versioning and execution verification

Preventive Measure: Implement workflow hash logging and pre-execution verification

Action Items Tracking

Create Gitea issues for each action item:

# Example using Gitea CLI or API
gh issue create --title "Implement PM2 state logging in deployment workflows" \
  --body "Related to incident YYYY-MM-DD. Add pre-deployment PM2 state capture." \
  --label "incident-follow-up,priority:high"

Track action items in a central location:

Issue # Action Owner Due Status
#123 Add PM2 state logging [Name] [Date] Open
#124 Implement workflow version hash [Name] [Date] Open
#125 Create deployment lock mechanism [Name] [Date] Open

Appendix: PM2 Command Reference

Safe Commands

# Status and monitoring
pm2 list
pm2 show <process-name>
pm2 monit
pm2 logs <process-name>

# Restart specific processes
pm2 restart flyer-crawler-api
pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker

# Reload (zero-downtime, cluster mode only)
pm2 reload flyer-crawler-api

# Start from config
pm2 start ecosystem.config.cjs
pm2 start ecosystem.config.cjs --only flyer-crawler-api

Dangerous Commands (Use With Caution)

# CAUTION: These affect ALL processes
pm2 stop all        # Stops every PM2 process
pm2 restart all     # Restarts every PM2 process
pm2 delete all      # Removes every PM2 process

# CAUTION: Modifies saved process list
pm2 save            # Overwrites saved process list
pm2 resurrect       # Restores from saved list

# CAUTION: Affects PM2 daemon
pm2 kill            # Kills PM2 daemon and all processes
pm2 update          # Updates PM2 in place (may cause brief outage)

Revision History

Date Author Change
2026-02-17 Incident Response Team Initial runbook creation