# PM2 Incident Response Runbook

**Purpose**: Step-by-step procedures for responding to PM2 process isolation incidents on the projectium.com server.

**Audience**: On-call responders, system administrators, developers with server access.

**Last updated**: 2026-02-17

**Related documentation**:

- [CLAUDE.md - PM2 Process Isolation Rules](../../CLAUDE.md)
- [Incident Report: 2026-02-17](INCIDENT-2026-02-17-PM2-PROCESS-KILL.md)
- [Monitoring Guide](MONITORING.md)
- [Deployment Guide](DEPLOYMENT.md)

---

## Table of Contents

1. [Quick Reference](#quick-reference)
2. [Detection](#detection)
3. [Initial Assessment](#initial-assessment)
4. [Immediate Response](#immediate-response)
5. [Process Restoration](#process-restoration)
6. [Root Cause Investigation](#root-cause-investigation)
7. [Communication Templates](#communication-templates)
8. [Prevention Measures](#prevention-measures)
9. [Contact Information](#contact-information)
10. [Post-Incident Review](#post-incident-review)

---

## Quick Reference

### PM2 Process Inventory

| Application   | Environment | Process Names                                                                                | Config File                 | Directory                                    |
| ------------- | ----------- | -------------------------------------------------------------------------------------------- | --------------------------- | -------------------------------------------- |
| Flyer Crawler | Production  | `flyer-crawler-api`, `flyer-crawler-worker`, `flyer-crawler-analytics-worker`                | `ecosystem.config.cjs`      | `/var/www/flyer-crawler.projectium.com`      |
| Flyer Crawler | Test        | `flyer-crawler-api-test`, `flyer-crawler-worker-test`, `flyer-crawler-analytics-worker-test` | `ecosystem-test.config.cjs` | `/var/www/flyer-crawler-test.projectium.com` |
| Stock Alert   | Production  | `stock-alert-*`                                                                              | (varies)                    | `/var/www/stock-alert.projectium.com`        |

### Critical Commands

```bash
# Check PM2 status
pm2 list

# Check specific process
pm2 show flyer-crawler-api

# View recent logs
pm2 logs --lines 50

# Restart specific processes (SAFE)
pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker

# DO NOT USE (affects ALL apps)
# pm2 restart all    <-- DANGEROUS
# pm2 stop all       <-- DANGEROUS
# pm2 delete all     <-- DANGEROUS
```

### Severity Classification

| Severity          | Criteria                                      | Response Time       | Example                                         |
| ----------------- | --------------------------------------------- | ------------------- | ----------------------------------------------- |
| **P1 - Critical** | Multiple applications down, production impact | Immediate (< 5 min) | All PM2 processes killed                        |
| **P2 - High**     | Single application down, production impact    | < 15 min            | Flyer Crawler prod down, Stock Alert unaffected |
| **P3 - Medium**   | Test environment only, no production impact   | < 1 hour            | Test processes killed, production unaffected    |

---

## Detection

### How to Identify a PM2 Incident

**Automated Indicators**:

- Health check failures on `/api/health/ready`
- Monitoring alerts (UptimeRobot, etc.)
- Bugsink showing connection errors
- NGINX returning 502 Bad Gateway

**User-Reported Symptoms**:

- "The site is down"
- "I can't log in"
- "Pages are loading slowly then timing out"
- "I see a 502 error"

**Manual Discovery**:

```bash
# SSH to server
ssh gitea-runner@projectium.com

# Check if PM2 is running
pm2 list

# Expected output shows processes
# If empty or all errored = incident
```

### Incident Signature: Process Isolation Violation

When a PM2 incident is caused by process isolation failure, you will see:

```text
# Expected state (normal):
+-----------------------------------+----+-----+---------+-------+
| App name                          | id |mode | status  | cpu   |
+-----------------------------------+----+-----+---------+-------+
| flyer-crawler-api                 | 0  |clust| online  | 0%    |
| flyer-crawler-worker              | 1  |fork | online  | 0%    |
| flyer-crawler-analytics-worker    | 2  |fork | online  | 0%    |
| flyer-crawler-api-test            | 3  |fork | online  | 0%    |
| flyer-crawler-worker-test         | 4  |fork | online  | 0%    |
| flyer-crawler-analytics-worker-test| 5 |fork | online  | 0%    |
| stock-alert-api                   | 6  |fork | online  | 0%    |
+-----------------------------------+----+-----+---------+-------+

# Incident state (isolation violation):
# All processes missing or errored - not just one app
+-----------------------------------+----+-----+---------+-------+
| App name                          | id |mode | status  | cpu   |
+-----------------------------------+----+-----+---------+-------+
# (empty or all processes errored/stopped)
+-----------------------------------+----+-----+---------+-------+
```

---

## Initial Assessment

### Step 1: Gather Information (2 minutes)

Run these commands and capture output:

```bash
# 1. Check PM2 status
pm2 list

# 2. Check PM2 daemon status
pm2 ping

# 3. Check recent PM2 logs
pm2 logs --lines 20 --nostream

# 4. Check system status
systemctl status pm2-gitea-runner --no-pager

# 5. Check disk space
df -h /

# 6. Check memory
free -h

# 7. Check recent deployments (in app directory)
cd /var/www/flyer-crawler.projectium.com
git log --oneline -5
```

### Step 2: Determine Scope

| Question                 | Command                                                          | Impact Level                    |
| ------------------------ | ---------------------------------------------------------------- | ------------------------------- |
| How many apps affected?  | `pm2 list`                                                       | Count missing/errored processes |
| Is production down?      | `curl https://flyer-crawler.projectium.com/api/health/ping`      | Yes/No                          |
| Is test down?            | `curl https://flyer-crawler-test.projectium.com/api/health/ping` | Yes/No                          |
| Are other apps affected? | `pm2 list \| grep stock-alert`                                   | Yes/No                          |

### Step 3: Classify Severity

```text
Decision Tree:

Production app(s) down?
    |
    +-- YES: Multiple apps affected?
    |       |
    |       +-- YES --> P1 CRITICAL (all apps down)
    |       |
    |       +-- NO --> P2 HIGH (single app down)
    |
    +-- NO: Test environment only?
            |
            +-- YES --> P3 MEDIUM
            |
            +-- NO --> Investigate further
```

### Step 4: Document Initial State

Capture this information before making any changes:

```bash
# Save PM2 state to file
pm2 jlist > /tmp/pm2-incident-$(date +%Y%m%d-%H%M%S).json

# Save system state
{
  echo "=== PM2 List ==="
  pm2 list
  echo ""
  echo "=== Disk Space ==="
  df -h
  echo ""
  echo "=== Memory ==="
  free -h
  echo ""
  echo "=== Recent Git Commits ==="
  cd /var/www/flyer-crawler.projectium.com && git log --oneline -5
} > /tmp/incident-state-$(date +%Y%m%d-%H%M%S).txt
```

---

## Immediate Response

### Priority 1: Stop Ongoing Deployments

If a deployment is currently running:

1. Check Gitea Actions for running workflows
2. Cancel any in-progress deployment workflows
3. Do NOT start new deployments until incident resolved

### Priority 2: Assess Which Processes Are Down

```bash
# Get list of processes and their status
pm2 list

# Check which processes exist but are errored/stopped
pm2 jlist | jq '.[] | {name, status: .pm2_env.status}'
```

### Priority 3: Establish Order of Restoration

Restore in this order (production first, critical path first):

| Priority | Process                               | Rationale                            |
| -------- | ------------------------------------- | ------------------------------------ |
| 1        | `flyer-crawler-api`                   | Production API - highest user impact |
| 2        | `flyer-crawler-worker`                | Production background jobs           |
| 3        | `flyer-crawler-analytics-worker`      | Production analytics                 |
| 4        | `stock-alert-*`                       | Other production apps                |
| 5        | `flyer-crawler-api-test`              | Test environment                     |
| 6        | `flyer-crawler-worker-test`           | Test background jobs                 |
| 7        | `flyer-crawler-analytics-worker-test` | Test analytics                       |

---

## Process Restoration

### Scenario A: Flyer Crawler Production Processes Missing

```bash
# Navigate to production directory
cd /var/www/flyer-crawler.projectium.com

# Start production processes
pm2 start ecosystem.config.cjs

# Verify processes started
pm2 list

# Check health endpoint
curl -s http://localhost:3001/api/health/ready | jq .
```

### Scenario B: Flyer Crawler Test Processes Missing

```bash
# Navigate to test directory
cd /var/www/flyer-crawler-test.projectium.com

# Start test processes
pm2 start ecosystem-test.config.cjs

# Verify processes started
pm2 list

# Check health endpoint
curl -s http://localhost:3002/api/health/ready | jq .
```

### Scenario C: Stock Alert Processes Missing

```bash
# Navigate to stock-alert directory
cd /var/www/stock-alert.projectium.com

# Start processes (adjust config file name as needed)
pm2 start ecosystem.config.cjs

# Verify processes started
pm2 list
```

### Scenario D: All Processes Missing

Execute restoration in priority order:

```bash
# 1. Flyer Crawler Production (highest priority)
cd /var/www/flyer-crawler.projectium.com
pm2 start ecosystem.config.cjs

# Verify production is healthy before continuing
curl -s http://localhost:3001/api/health/ready | jq '.data.status'
# Should return "healthy"

# 2. Stock Alert Production
cd /var/www/stock-alert.projectium.com
pm2 start ecosystem.config.cjs

# 3. Flyer Crawler Test (lower priority)
cd /var/www/flyer-crawler-test.projectium.com
pm2 start ecosystem-test.config.cjs

# 4. Save PM2 process list
pm2 save

# 5. Final verification
pm2 list
```

### Health Check Verification

After restoration, verify each application:

**Flyer Crawler Production**:

```bash
# API health
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'
# Expected: "healthy"

# Check all services
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.services'
```

**Flyer Crawler Test**:

```bash
curl -s https://flyer-crawler-test.projectium.com/api/health/ready | jq '.data.status'
```

**Stock Alert**:

```bash
# Adjust URL as appropriate for stock-alert
curl -s https://stock-alert.projectium.com/api/health/ready | jq '.data.status'
```

### Verification Checklist

After restoration, confirm:

- [ ] `pm2 list` shows all expected processes as `online`
- [ ] Production health check returns `healthy`
- [ ] Test health check returns `healthy` (if applicable)
- [ ] No processes showing high restart count
- [ ] No processes showing `errored` or `stopped` status
- [ ] PM2 process list saved: `pm2 save`

---

## Root Cause Investigation

### Step 1: Check Workflow Execution Logs

```bash
# Find recent Gitea Actions runs
# (Access via Gitea web UI: Repository > Actions > Recent Runs)

# Look for these workflows:
# - deploy-to-prod.yml
# - deploy-to-test.yml
# - manual-deploy-major.yml
# - manual-db-restore.yml
```

### Step 2: Check PM2 Daemon Logs

```bash
# PM2 daemon logs
cat ~/.pm2/pm2.log | tail -100

# PM2 process-specific logs
ls -la ~/.pm2/logs/

# Recent API logs
tail -100 ~/.pm2/logs/flyer-crawler-api-out.log
tail -100 ~/.pm2/logs/flyer-crawler-api-error.log
```

### Step 3: Check System Logs

```bash
# System journal for PM2 service
journalctl -u pm2-gitea-runner -n 100 --no-pager

# Kernel messages (OOM killer, etc.)
journalctl -k -n 50 --no-pager | grep -i "killed\|oom\|memory"

# Authentication logs (unauthorized access)
tail -50 /var/log/auth.log
```

### Step 4: Git History Analysis

```bash
# Recent commits to deployment workflows
cd /var/www/flyer-crawler.projectium.com
git log --oneline -20 -- .gitea/workflows/

# Check what changed in PM2 configs
git log --oneline -10 -- ecosystem.config.cjs ecosystem-test.config.cjs

# Diff against last known good state
git diff <last-good-commit> -- .gitea/workflows/ ecosystem*.cjs
```

### Step 5: Timing Correlation

Create a timeline:

```text
| Time (UTC) | Event | Source |
|------------|-------|--------|
| XX:XX | Last successful health check | Monitoring |
| XX:XX | Deployment workflow started | Gitea Actions |
| XX:XX | First failed health check | Monitoring |
| XX:XX | Incident detected | User report / Alert |
| XX:XX | Investigation started | On-call |
```

### Common Root Causes

| Root Cause                   | Evidence                               | Prevention                   |
| ---------------------------- | -------------------------------------- | ---------------------------- |
| `pm2 stop all` in workflow   | Workflow logs show "all" command       | Use explicit process names   |
| `pm2 delete all` in workflow | Empty PM2 list after deploy            | Use whitelist-based deletion |
| OOM killer                   | `journalctl -k` shows "Killed process" | Increase memory limits       |
| Disk space exhaustion        | `df -h` shows 100%                     | Log rotation, cleanup        |
| Manual intervention          | Shell history shows pm2 commands       | Document all manual actions  |
| Concurrent deployments       | Multiple workflows at same time        | Implement deployment locks   |
| Workflow caching issue       | Old workflow version executed          | Force workflow refresh       |

---

## Communication Templates

### Incident Notification (Internal)

```text
Subject: [P1 INCIDENT] PM2 Process Isolation Failure - Multiple Apps Down

Status: INVESTIGATING
Time Detected: YYYY-MM-DD HH:MM UTC
Affected Systems: [flyer-crawler-prod, stock-alert-prod, ...]

Summary:
All PM2 processes on projectium.com server were terminated unexpectedly.
Multiple production applications are currently down.

Impact:
- flyer-crawler.projectium.com: DOWN
- stock-alert.projectium.com: DOWN
- [other affected apps]

Current Actions:
- Restoring critical production processes
- Investigating root cause

Next Update: In 15 minutes or upon status change

Incident Commander: [Name]
```

### Status Update Template

```text
Subject: [P1 INCIDENT] PM2 Process Isolation Failure - UPDATE #N

Status: [INVESTIGATING | IDENTIFIED | RESTORING | RESOLVED]
Time: YYYY-MM-DD HH:MM UTC

Progress Since Last Update:
- [Action taken]
- [Discovery made]
- [Process restored]

Current State:
- flyer-crawler.projectium.com: [UP|DOWN]
- stock-alert.projectium.com: [UP|DOWN]

Root Cause: [If identified]

Next Steps:
- [Planned action]

ETA to Resolution: [If known]

Next Update: In [X] minutes
```

### Resolution Notification

```text
Subject: [RESOLVED] PM2 Process Isolation Failure

Status: RESOLVED
Time Resolved: YYYY-MM-DD HH:MM UTC
Total Downtime: X minutes

Summary:
All PM2 processes have been restored. Services are operating normally.

Root Cause:
[Brief description of what caused the incident]

Impact Summary:
- flyer-crawler.projectium.com: Down for X minutes
- stock-alert.projectium.com: Down for X minutes
- Estimated user impact: [description]

Immediate Actions Taken:
1. [Action]
2. [Action]

Follow-up Actions:
1. [ ] [Preventive measure] - Owner: [Name] - Due: [Date]
2. [ ] Post-incident review scheduled for [Date]

Post-Incident Review: [Link or scheduled time]
```

---

## Prevention Measures

### Pre-Deployment Checklist

Before triggering any deployment:

- [ ] Review workflow file for PM2 commands
- [ ] Confirm no `pm2 stop all`, `pm2 delete all`, or `pm2 restart all`
- [ ] Verify process names are explicitly listed
- [ ] Check for concurrent deployment risks
- [ ] Confirm recent workflow changes were reviewed

### Workflow Review Checklist

When reviewing deployment workflow changes:

- [ ] All PM2 `stop` commands use explicit process names
- [ ] All PM2 `delete` commands filter by process name pattern
- [ ] All PM2 `restart` commands use explicit process names
- [ ] Test deployments filter by `-test` suffix
- [ ] Production deployments use whitelist array

**Safe Patterns**:

```javascript
// SAFE: Explicit process names (production)
const prodProcesses = [
  'flyer-crawler-api',
  'flyer-crawler-worker',
  'flyer-crawler-analytics-worker',
];
list.forEach((p) => {
  if (
    (p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
    prodProcesses.includes(p.name)
  ) {
    exec('pm2 delete ' + p.pm2_env.pm_id);
  }
});

// SAFE: Pattern-based filtering (test)
list.forEach((p) => {
  if (p.name && p.name.endsWith('-test')) {
    exec('pm2 delete ' + p.pm2_env.pm_id);
  }
});
```

**Dangerous Patterns** (NEVER USE):

```bash
# DANGEROUS - affects ALL applications
pm2 stop all
pm2 delete all
pm2 restart all

# DANGEROUS - no name filtering
pm2 delete $(pm2 jlist | jq -r '.[] | select(.pm2_env.status == "errored") | .pm_id')
```

### PM2 Configuration Validation

Before deploying PM2 config changes:

```bash
# Test configuration locally
cd /var/www/flyer-crawler.projectium.com
node -e "console.log(JSON.stringify(require('./ecosystem.config.cjs'), null, 2))"

# Verify process names
node -e "require('./ecosystem.config.cjs').apps.forEach(a => console.log(a.name))"

# Expected output should match documented process names
```

### Deployment Monitoring

After every deployment:

```bash
# Immediate verification
pm2 list

# Check no unexpected processes were affected
pm2 list | grep -v flyer-crawler
# Should still show other apps (e.g., stock-alert)

# Health check
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'
```

---

## Contact Information

### On-Call Escalation

| Role              | Contact        | When to Escalate                    |
| ----------------- | -------------- | ----------------------------------- |
| Primary On-Call   | [Name/Channel] | First responder                     |
| Secondary On-Call | [Name/Channel] | If primary unavailable after 10 min |
| Engineering Lead  | [Name/Channel] | P1 incidents > 30 min               |
| Product Owner     | [Name/Channel] | User communication needed           |

### External Dependencies

| Service         | Support Channel | When to Contact         |
| --------------- | --------------- | ----------------------- |
| Server Provider | [Contact info]  | Hardware/network issues |
| DNS Provider    | [Contact info]  | DNS resolution failures |
| SSL Certificate | [Contact info]  | Certificate issues      |

### Communication Channels

| Channel        | Purpose                    |
| -------------- | -------------------------- |
| `#incidents`   | Real-time incident updates |
| `#deployments` | Deployment announcements   |
| `#engineering` | Technical discussion       |
| Email list     | Formal notifications       |

---

## Post-Incident Review

### Incident Report Template

```markdown
# Incident Report: [Title]

## Overview

| Field              | Value             |
| ------------------ | ----------------- |
| Date               | YYYY-MM-DD        |
| Duration           | X hours Y minutes |
| Severity           | P1/P2/P3          |
| Incident Commander | [Name]            |
| Status             | Resolved          |

## Timeline

| Time (UTC) | Event               |
| ---------- | ------------------- |
| HH:MM      | [Event description] |
| HH:MM      | [Event description] |

## Impact

- **Users affected**: [Number/description]
- **Revenue impact**: [If applicable]
- **SLA impact**: [If applicable]

## Root Cause

[Detailed technical explanation]

## Resolution

[What was done to resolve the incident]

## Contributing Factors

1. [Factor]
2. [Factor]

## Action Items

| Action   | Owner  | Due Date | Status |
| -------- | ------ | -------- | ------ |
| [Action] | [Name] | [Date]   | [ ]    |

## Lessons Learned

### What Went Well

- [Item]

### What Could Be Improved

- [Item]

## Appendix

- Link to monitoring data
- Link to relevant logs
- Link to workflow runs
```

### Lessons Learned Format

Use "5 Whys" technique:

```text
Problem: All PM2 processes were killed during deployment

Why 1: The deployment workflow ran `pm2 delete all`
Why 2: The workflow used an outdated version of the script
Why 3: Gitea runner cached the old workflow file
Why 4: No mechanism to verify workflow version before execution
Why 5: Workflow versioning and audit trail not implemented

Root Cause: Lack of workflow versioning and execution verification

Preventive Measure: Implement workflow hash logging and pre-execution verification
```

### Action Items Tracking

Create Gitea issues for each action item:

```bash
# Example using Gitea CLI or API
gh issue create --title "Implement PM2 state logging in deployment workflows" \
  --body "Related to incident YYYY-MM-DD. Add pre-deployment PM2 state capture." \
  --label "incident-follow-up,priority:high"
```

Track action items in a central location:

| Issue # | Action                           | Owner  | Due    | Status |
| ------- | -------------------------------- | ------ | ------ | ------ |
| #123    | Add PM2 state logging            | [Name] | [Date] | Open   |
| #124    | Implement workflow version hash  | [Name] | [Date] | Open   |
| #125    | Create deployment lock mechanism | [Name] | [Date] | Open   |

---

## Appendix: PM2 Command Reference

### Safe Commands

```bash
# Status and monitoring
pm2 list
pm2 show <process-name>
pm2 monit
pm2 logs <process-name>

# Restart specific processes
pm2 restart flyer-crawler-api
pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker

# Reload (zero-downtime, cluster mode only)
pm2 reload flyer-crawler-api

# Start from config
pm2 start ecosystem.config.cjs
pm2 start ecosystem.config.cjs --only flyer-crawler-api
```

### Dangerous Commands (Use With Caution)

```bash
# CAUTION: These affect ALL processes
pm2 stop all        # Stops every PM2 process
pm2 restart all     # Restarts every PM2 process
pm2 delete all      # Removes every PM2 process

# CAUTION: Modifies saved process list
pm2 save            # Overwrites saved process list
pm2 resurrect       # Restores from saved list

# CAUTION: Affects PM2 daemon
pm2 kill            # Kills PM2 daemon and all processes
pm2 update          # Updates PM2 in place (may cause brief outage)
```

---

## Revision History

| Date       | Author                 | Change                   |
| ---------- | ---------------------- | ------------------------ |
| 2026-02-17 | Incident Response Team | Initial runbook creation |