23 KiB
PM2 Incident Response Runbook
Purpose: Step-by-step procedures for responding to PM2 process isolation incidents on the projectium.com server.
Audience: On-call responders, system administrators, developers with server access.
Last updated: 2026-02-17
Related documentation:
- CLAUDE.md - PM2 Process Isolation Rules
- Incident Report: 2026-02-17
- Monitoring Guide
- Deployment Guide
Table of Contents
- Quick Reference
- Detection
- Initial Assessment
- Immediate Response
- Process Restoration
- Root Cause Investigation
- Communication Templates
- Prevention Measures
- Contact Information
- Post-Incident Review
Quick Reference
PM2 Process Inventory
| Application | Environment | Process Names | Config File | Directory |
|---|---|---|---|---|
| Flyer Crawler | Production | flyer-crawler-api, flyer-crawler-worker, flyer-crawler-analytics-worker |
ecosystem.config.cjs |
/var/www/flyer-crawler.projectium.com |
| Flyer Crawler | Test | flyer-crawler-api-test, flyer-crawler-worker-test, flyer-crawler-analytics-worker-test |
ecosystem-test.config.cjs |
/var/www/flyer-crawler-test.projectium.com |
| Stock Alert | Production | stock-alert-* |
(varies) | /var/www/stock-alert.projectium.com |
Critical Commands
IMPORTANT: Every pm2 start, pm2 restart, pm2 stop, or pm2 delete command MUST be immediately followed by pm2 save to persist changes.
# Check PM2 status (read-only, no save needed)
pm2 list
# Check specific process (read-only, no save needed)
pm2 show flyer-crawler-api
# View recent logs (read-only, no save needed)
pm2 logs --lines 50
# ✅ Restart specific processes (SAFE - includes save)
pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker && pm2 save
# ✅ Start processes from config (SAFE - includes save)
pm2 startOrReload /var/www/flyer-crawler.projectium.com/ecosystem.config.cjs --update-env && pm2 save
# ❌ DO NOT USE (affects ALL apps)
# pm2 restart all <-- DANGEROUS
# pm2 stop all <-- DANGEROUS
# pm2 delete all <-- DANGEROUS
# ❌ DO NOT FORGET pm2 save after state changes
# pm2 restart flyer-crawler-api <-- WRONG: Missing save, process becomes ephemeral
Why pm2 save Matters: Without it, processes become ephemeral and disappear on daemon restarts, server reboots, or internal PM2 reconciliation events.
Severity Classification
| Severity | Criteria | Response Time | Example |
|---|---|---|---|
| P1 - Critical | Multiple applications down, production impact | Immediate (< 5 min) | All PM2 processes killed |
| P2 - High | Single application down, production impact | < 15 min | Flyer Crawler prod down, Stock Alert unaffected |
| P3 - Medium | Test environment only, no production impact | < 1 hour | Test processes killed, production unaffected |
Detection
How to Identify a PM2 Incident
Automated Indicators:
- Health check failures on
/api/health/ready - Monitoring alerts (UptimeRobot, etc.)
- Bugsink showing connection errors
- NGINX returning 502 Bad Gateway
User-Reported Symptoms:
- "The site is down"
- "I can't log in"
- "Pages are loading slowly then timing out"
- "I see a 502 error"
Manual Discovery:
# SSH to server
ssh gitea-runner@projectium.com
# Check if PM2 is running
pm2 list
# Expected output shows processes
# If empty or all errored = incident
Incident Signature: Process Isolation Violation
When a PM2 incident is caused by process isolation failure, you will see:
# Expected state (normal):
+-----------------------------------+----+-----+---------+-------+
| App name | id |mode | status | cpu |
+-----------------------------------+----+-----+---------+-------+
| flyer-crawler-api | 0 |clust| online | 0% |
| flyer-crawler-worker | 1 |fork | online | 0% |
| flyer-crawler-analytics-worker | 2 |fork | online | 0% |
| flyer-crawler-api-test | 3 |fork | online | 0% |
| flyer-crawler-worker-test | 4 |fork | online | 0% |
| flyer-crawler-analytics-worker-test| 5 |fork | online | 0% |
| stock-alert-api | 6 |fork | online | 0% |
+-----------------------------------+----+-----+---------+-------+
# Incident state (isolation violation):
# All processes missing or errored - not just one app
+-----------------------------------+----+-----+---------+-------+
| App name | id |mode | status | cpu |
+-----------------------------------+----+-----+---------+-------+
# (empty or all processes errored/stopped)
+-----------------------------------+----+-----+---------+-------+
Initial Assessment
Step 1: Gather Information (2 minutes)
Run these commands and capture output:
# 1. Check PM2 status
pm2 list
# 2. Check PM2 daemon status
pm2 ping
# 3. Check recent PM2 logs
pm2 logs --lines 20 --nostream
# 4. Check system status
systemctl status pm2-gitea-runner --no-pager
# 5. Check disk space
df -h /
# 6. Check memory
free -h
# 7. Check recent deployments (in app directory)
cd /var/www/flyer-crawler.projectium.com
git log --oneline -5
Step 2: Determine Scope
| Question | Command | Impact Level |
|---|---|---|
| How many apps affected? | pm2 list |
Count missing/errored processes |
| Is production down? | curl https://flyer-crawler.projectium.com/api/health/ping |
Yes/No |
| Is test down? | curl https://flyer-crawler-test.projectium.com/api/health/ping |
Yes/No |
| Are other apps affected? | pm2 list | grep stock-alert |
Yes/No |
Step 3: Classify Severity
Decision Tree:
Production app(s) down?
|
+-- YES: Multiple apps affected?
| |
| +-- YES --> P1 CRITICAL (all apps down)
| |
| +-- NO --> P2 HIGH (single app down)
|
+-- NO: Test environment only?
|
+-- YES --> P3 MEDIUM
|
+-- NO --> Investigate further
Step 4: Document Initial State
Capture this information before making any changes:
# Save PM2 state to file
pm2 jlist > /tmp/pm2-incident-$(date +%Y%m%d-%H%M%S).json
# Save system state
{
echo "=== PM2 List ==="
pm2 list
echo ""
echo "=== Disk Space ==="
df -h
echo ""
echo "=== Memory ==="
free -h
echo ""
echo "=== Recent Git Commits ==="
cd /var/www/flyer-crawler.projectium.com && git log --oneline -5
} > /tmp/incident-state-$(date +%Y%m%d-%H%M%S).txt
Immediate Response
Priority 1: Stop Ongoing Deployments
If a deployment is currently running:
- Check Gitea Actions for running workflows
- Cancel any in-progress deployment workflows
- Do NOT start new deployments until incident resolved
Priority 2: Assess Which Processes Are Down
# Get list of processes and their status
pm2 list
# Check which processes exist but are errored/stopped
pm2 jlist | jq '.[] | {name, status: .pm2_env.status}'
Priority 3: Establish Order of Restoration
Restore in this order (production first, critical path first):
| Priority | Process | Rationale |
|---|---|---|
| 1 | flyer-crawler-api |
Production API - highest user impact |
| 2 | flyer-crawler-worker |
Production background jobs |
| 3 | flyer-crawler-analytics-worker |
Production analytics |
| 4 | stock-alert-* |
Other production apps |
| 5 | flyer-crawler-api-test |
Test environment |
| 6 | flyer-crawler-worker-test |
Test background jobs |
| 7 | flyer-crawler-analytics-worker-test |
Test analytics |
Process Restoration
Scenario A: Flyer Crawler Production Processes Missing
# Navigate to production directory
cd /var/www/flyer-crawler.projectium.com
# Start production processes
pm2 start ecosystem.config.cjs
# Verify processes started
pm2 list
# Check health endpoint
curl -s http://localhost:3001/api/health/ready | jq .
Scenario B: Flyer Crawler Test Processes Missing
# Navigate to test directory
cd /var/www/flyer-crawler-test.projectium.com
# Start test processes
pm2 start ecosystem-test.config.cjs
# Verify processes started
pm2 list
# Check health endpoint
curl -s http://localhost:3002/api/health/ready | jq .
Scenario C: Stock Alert Processes Missing
# Navigate to stock-alert directory
cd /var/www/stock-alert.projectium.com
# Start processes (adjust config file name as needed)
pm2 start ecosystem.config.cjs
# Verify processes started
pm2 list
Scenario D: All Processes Missing
Execute restoration in priority order:
# 1. Flyer Crawler Production (highest priority)
cd /var/www/flyer-crawler.projectium.com
pm2 start ecosystem.config.cjs
# Verify production is healthy before continuing
curl -s http://localhost:3001/api/health/ready | jq '.data.status'
# Should return "healthy"
# 2. Stock Alert Production
cd /var/www/stock-alert.projectium.com
pm2 start ecosystem.config.cjs
# 3. Flyer Crawler Test (lower priority)
cd /var/www/flyer-crawler-test.projectium.com
pm2 start ecosystem-test.config.cjs
# 4. Save PM2 process list
pm2 save
# 5. Final verification
pm2 list
Health Check Verification
After restoration, verify each application:
Flyer Crawler Production:
# API health
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'
# Expected: "healthy"
# Check all services
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.services'
Flyer Crawler Test:
curl -s https://flyer-crawler-test.projectium.com/api/health/ready | jq '.data.status'
Stock Alert:
# Adjust URL as appropriate for stock-alert
curl -s https://stock-alert.projectium.com/api/health/ready | jq '.data.status'
Verification Checklist
After restoration, confirm:
pm2 listshows all expected processes asonline- Production health check returns
healthy - Test health check returns
healthy(if applicable) - No processes showing high restart count
- No processes showing
erroredorstoppedstatus - PM2 process list saved:
pm2 save
Root Cause Investigation
Step 1: Check Workflow Execution Logs
# Find recent Gitea Actions runs
# (Access via Gitea web UI: Repository > Actions > Recent Runs)
# Look for these workflows:
# - deploy-to-prod.yml
# - deploy-to-test.yml
# - manual-deploy-major.yml
# - manual-db-restore.yml
Step 2: Check PM2 Daemon Logs
# PM2 daemon logs
cat ~/.pm2/pm2.log | tail -100
# PM2 process-specific logs
ls -la ~/.pm2/logs/
# Recent API logs
tail -100 ~/.pm2/logs/flyer-crawler-api-out.log
tail -100 ~/.pm2/logs/flyer-crawler-api-error.log
Step 3: Check System Logs
# System journal for PM2 service
journalctl -u pm2-gitea-runner -n 100 --no-pager
# Kernel messages (OOM killer, etc.)
journalctl -k -n 50 --no-pager | grep -i "killed\|oom\|memory"
# Authentication logs (unauthorized access)
tail -50 /var/log/auth.log
Step 4: Git History Analysis
# Recent commits to deployment workflows
cd /var/www/flyer-crawler.projectium.com
git log --oneline -20 -- .gitea/workflows/
# Check what changed in PM2 configs
git log --oneline -10 -- ecosystem.config.cjs ecosystem-test.config.cjs
# Diff against last known good state
git diff <last-good-commit> -- .gitea/workflows/ ecosystem*.cjs
Step 5: Timing Correlation
Create a timeline:
| Time (UTC) | Event | Source |
|------------|-------|--------|
| XX:XX | Last successful health check | Monitoring |
| XX:XX | Deployment workflow started | Gitea Actions |
| XX:XX | First failed health check | Monitoring |
| XX:XX | Incident detected | User report / Alert |
| XX:XX | Investigation started | On-call |
Common Root Causes
| Root Cause | Evidence | Prevention |
|---|---|---|
pm2 stop all in workflow |
Workflow logs show "all" command | Use explicit process names |
pm2 delete all in workflow |
Empty PM2 list after deploy | Use whitelist-based deletion |
| OOM killer | journalctl -k shows "Killed process" |
Increase memory limits |
| Disk space exhaustion | df -h shows 100% |
Log rotation, cleanup |
| Manual intervention | Shell history shows pm2 commands | Document all manual actions |
| Concurrent deployments | Multiple workflows at same time | Implement deployment locks |
| Workflow caching issue | Old workflow version executed | Force workflow refresh |
Communication Templates
Incident Notification (Internal)
Subject: [P1 INCIDENT] PM2 Process Isolation Failure - Multiple Apps Down
Status: INVESTIGATING
Time Detected: YYYY-MM-DD HH:MM UTC
Affected Systems: [flyer-crawler-prod, stock-alert-prod, ...]
Summary:
All PM2 processes on projectium.com server were terminated unexpectedly.
Multiple production applications are currently down.
Impact:
- flyer-crawler.projectium.com: DOWN
- stock-alert.projectium.com: DOWN
- [other affected apps]
Current Actions:
- Restoring critical production processes
- Investigating root cause
Next Update: In 15 minutes or upon status change
Incident Commander: [Name]
Status Update Template
Subject: [P1 INCIDENT] PM2 Process Isolation Failure - UPDATE #N
Status: [INVESTIGATING | IDENTIFIED | RESTORING | RESOLVED]
Time: YYYY-MM-DD HH:MM UTC
Progress Since Last Update:
- [Action taken]
- [Discovery made]
- [Process restored]
Current State:
- flyer-crawler.projectium.com: [UP|DOWN]
- stock-alert.projectium.com: [UP|DOWN]
Root Cause: [If identified]
Next Steps:
- [Planned action]
ETA to Resolution: [If known]
Next Update: In [X] minutes
Resolution Notification
Subject: [RESOLVED] PM2 Process Isolation Failure
Status: RESOLVED
Time Resolved: YYYY-MM-DD HH:MM UTC
Total Downtime: X minutes
Summary:
All PM2 processes have been restored. Services are operating normally.
Root Cause:
[Brief description of what caused the incident]
Impact Summary:
- flyer-crawler.projectium.com: Down for X minutes
- stock-alert.projectium.com: Down for X minutes
- Estimated user impact: [description]
Immediate Actions Taken:
1. [Action]
2. [Action]
Follow-up Actions:
1. [ ] [Preventive measure] - Owner: [Name] - Due: [Date]
2. [ ] Post-incident review scheduled for [Date]
Post-Incident Review: [Link or scheduled time]
Prevention Measures
Pre-Deployment Checklist
Before triggering any deployment:
- Review workflow file for PM2 commands
- Confirm no
pm2 stop all,pm2 delete all, orpm2 restart all - Verify process names are explicitly listed
- Check for concurrent deployment risks
- Confirm recent workflow changes were reviewed
Workflow Review Checklist
When reviewing deployment workflow changes:
- All PM2
stopcommands use explicit process names - All PM2
deletecommands filter by process name pattern - All PM2
restartcommands use explicit process names - Test deployments filter by
-testsuffix - Production deployments use whitelist array
Safe Patterns:
// SAFE: Explicit process names (production)
const prodProcesses = [
'flyer-crawler-api',
'flyer-crawler-worker',
'flyer-crawler-analytics-worker',
];
list.forEach((p) => {
if (
(p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
prodProcesses.includes(p.name)
) {
exec('pm2 delete ' + p.pm2_env.pm_id);
}
});
// SAFE: Pattern-based filtering (test)
list.forEach((p) => {
if (p.name && p.name.endsWith('-test')) {
exec('pm2 delete ' + p.pm2_env.pm_id);
}
});
Dangerous Patterns (NEVER USE):
# DANGEROUS - affects ALL applications
pm2 stop all
pm2 delete all
pm2 restart all
# DANGEROUS - no name filtering
pm2 delete $(pm2 jlist | jq -r '.[] | select(.pm2_env.status == "errored") | .pm_id')
PM2 Configuration Validation
Before deploying PM2 config changes:
# Test configuration locally
cd /var/www/flyer-crawler.projectium.com
node -e "console.log(JSON.stringify(require('./ecosystem.config.cjs'), null, 2))"
# Verify process names
node -e "require('./ecosystem.config.cjs').apps.forEach(a => console.log(a.name))"
# Expected output should match documented process names
Deployment Monitoring
After every deployment:
# Immediate verification
pm2 list
# Check no unexpected processes were affected
pm2 list | grep -v flyer-crawler
# Should still show other apps (e.g., stock-alert)
# Health check
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'
Contact Information
On-Call Escalation
| Role | Contact | When to Escalate |
|---|---|---|
| Primary On-Call | [Name/Channel] | First responder |
| Secondary On-Call | [Name/Channel] | If primary unavailable after 10 min |
| Engineering Lead | [Name/Channel] | P1 incidents > 30 min |
| Product Owner | [Name/Channel] | User communication needed |
External Dependencies
| Service | Support Channel | When to Contact |
|---|---|---|
| Server Provider | [Contact info] | Hardware/network issues |
| DNS Provider | [Contact info] | DNS resolution failures |
| SSL Certificate | [Contact info] | Certificate issues |
Communication Channels
| Channel | Purpose |
|---|---|
#incidents |
Real-time incident updates |
#deployments |
Deployment announcements |
#engineering |
Technical discussion |
| Email list | Formal notifications |
Post-Incident Review
Incident Report Template
# Incident Report: [Title]
## Overview
| Field | Value |
| ------------------ | ----------------- |
| Date | YYYY-MM-DD |
| Duration | X hours Y minutes |
| Severity | P1/P2/P3 |
| Incident Commander | [Name] |
| Status | Resolved |
## Timeline
| Time (UTC) | Event |
| ---------- | ------------------- |
| HH:MM | [Event description] |
| HH:MM | [Event description] |
## Impact
- **Users affected**: [Number/description]
- **Revenue impact**: [If applicable]
- **SLA impact**: [If applicable]
## Root Cause
[Detailed technical explanation]
## Resolution
[What was done to resolve the incident]
## Contributing Factors
1. [Factor]
2. [Factor]
## Action Items
| Action | Owner | Due Date | Status |
| -------- | ------ | -------- | ------ |
| [Action] | [Name] | [Date] | [ ] |
## Lessons Learned
### What Went Well
- [Item]
### What Could Be Improved
- [Item]
## Appendix
- Link to monitoring data
- Link to relevant logs
- Link to workflow runs
Lessons Learned Format
Use "5 Whys" technique:
Problem: All PM2 processes were killed during deployment
Why 1: The deployment workflow ran `pm2 delete all`
Why 2: The workflow used an outdated version of the script
Why 3: Gitea runner cached the old workflow file
Why 4: No mechanism to verify workflow version before execution
Why 5: Workflow versioning and audit trail not implemented
Root Cause: Lack of workflow versioning and execution verification
Preventive Measure: Implement workflow hash logging and pre-execution verification
Action Items Tracking
Create Gitea issues for each action item:
# Example using Gitea CLI or API
gh issue create --title "Implement PM2 state logging in deployment workflows" \
--body "Related to incident YYYY-MM-DD. Add pre-deployment PM2 state capture." \
--label "incident-follow-up,priority:high"
Track action items in a central location:
| Issue # | Action | Owner | Due | Status |
|---|---|---|---|---|
| #123 | Add PM2 state logging | [Name] | [Date] | Open |
| #124 | Implement workflow version hash | [Name] | [Date] | Open |
| #125 | Create deployment lock mechanism | [Name] | [Date] | Open |
Appendix: PM2 Command Reference
Safe Commands
# Status and monitoring
pm2 list
pm2 show <process-name>
pm2 monit
pm2 logs <process-name>
# Restart specific processes
pm2 restart flyer-crawler-api
pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker
# Reload (zero-downtime, cluster mode only)
pm2 reload flyer-crawler-api
# Start from config
pm2 start ecosystem.config.cjs
pm2 start ecosystem.config.cjs --only flyer-crawler-api
Dangerous Commands (Use With Caution)
# CAUTION: These affect ALL processes
pm2 stop all # Stops every PM2 process
pm2 restart all # Restarts every PM2 process
pm2 delete all # Removes every PM2 process
# CAUTION: Modifies saved process list
pm2 save # Overwrites saved process list
pm2 resurrect # Restores from saved list
# CAUTION: Affects PM2 daemon
pm2 kill # Kills PM2 daemon and all processes
pm2 update # Updates PM2 in place (may cause brief outage)
Revision History
| Date | Author | Change |
|---|---|---|
| 2026-02-17 | Incident Response Team | Initial runbook creation |