# PM2 Incident Response Runbook **Purpose**: Step-by-step procedures for responding to PM2 process isolation incidents on the projectium.com server. **Audience**: On-call responders, system administrators, developers with server access. **Last updated**: 2026-02-17 **Related documentation**: - [CLAUDE.md - PM2 Process Isolation Rules](../../CLAUDE.md) - [Incident Report: 2026-02-17](INCIDENT-2026-02-17-PM2-PROCESS-KILL.md) - [Monitoring Guide](MONITORING.md) - [Deployment Guide](DEPLOYMENT.md) --- ## Table of Contents 1. [Quick Reference](#quick-reference) 2. [Detection](#detection) 3. [Initial Assessment](#initial-assessment) 4. [Immediate Response](#immediate-response) 5. [Process Restoration](#process-restoration) 6. [Root Cause Investigation](#root-cause-investigation) 7. [Communication Templates](#communication-templates) 8. [Prevention Measures](#prevention-measures) 9. [Contact Information](#contact-information) 10. [Post-Incident Review](#post-incident-review) --- ## Quick Reference ### PM2 Process Inventory | Application | Environment | Process Names | Config File | Directory | | ------------- | ----------- | -------------------------------------------------------------------------------------------- | --------------------------- | -------------------------------------------- | | Flyer Crawler | Production | `flyer-crawler-api`, `flyer-crawler-worker`, `flyer-crawler-analytics-worker` | `ecosystem.config.cjs` | `/var/www/flyer-crawler.projectium.com` | | Flyer Crawler | Test | `flyer-crawler-api-test`, `flyer-crawler-worker-test`, `flyer-crawler-analytics-worker-test` | `ecosystem-test.config.cjs` | `/var/www/flyer-crawler-test.projectium.com` | | Stock Alert | Production | `stock-alert-*` | (varies) | `/var/www/stock-alert.projectium.com` | ### Critical Commands ```bash # Check PM2 status pm2 list # Check specific process pm2 show flyer-crawler-api # View recent logs pm2 logs --lines 50 # Restart specific processes (SAFE) pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker # DO NOT USE (affects ALL apps) # pm2 restart all <-- DANGEROUS # pm2 stop all <-- DANGEROUS # pm2 delete all <-- DANGEROUS ``` ### Severity Classification | Severity | Criteria | Response Time | Example | | ----------------- | --------------------------------------------- | ------------------- | ----------------------------------------------- | | **P1 - Critical** | Multiple applications down, production impact | Immediate (< 5 min) | All PM2 processes killed | | **P2 - High** | Single application down, production impact | < 15 min | Flyer Crawler prod down, Stock Alert unaffected | | **P3 - Medium** | Test environment only, no production impact | < 1 hour | Test processes killed, production unaffected | --- ## Detection ### How to Identify a PM2 Incident **Automated Indicators**: - Health check failures on `/api/health/ready` - Monitoring alerts (UptimeRobot, etc.) - Bugsink showing connection errors - NGINX returning 502 Bad Gateway **User-Reported Symptoms**: - "The site is down" - "I can't log in" - "Pages are loading slowly then timing out" - "I see a 502 error" **Manual Discovery**: ```bash # SSH to server ssh gitea-runner@projectium.com # Check if PM2 is running pm2 list # Expected output shows processes # If empty or all errored = incident ``` ### Incident Signature: Process Isolation Violation When a PM2 incident is caused by process isolation failure, you will see: ```text # Expected state (normal): +-----------------------------------+----+-----+---------+-------+ | App name | id |mode | status | cpu | +-----------------------------------+----+-----+---------+-------+ | flyer-crawler-api | 0 |clust| online | 0% | | flyer-crawler-worker | 1 |fork | online | 0% | | flyer-crawler-analytics-worker | 2 |fork | online | 0% | | flyer-crawler-api-test | 3 |fork | online | 0% | | flyer-crawler-worker-test | 4 |fork | online | 0% | | flyer-crawler-analytics-worker-test| 5 |fork | online | 0% | | stock-alert-api | 6 |fork | online | 0% | +-----------------------------------+----+-----+---------+-------+ # Incident state (isolation violation): # All processes missing or errored - not just one app +-----------------------------------+----+-----+---------+-------+ | App name | id |mode | status | cpu | +-----------------------------------+----+-----+---------+-------+ # (empty or all processes errored/stopped) +-----------------------------------+----+-----+---------+-------+ ``` --- ## Initial Assessment ### Step 1: Gather Information (2 minutes) Run these commands and capture output: ```bash # 1. Check PM2 status pm2 list # 2. Check PM2 daemon status pm2 ping # 3. Check recent PM2 logs pm2 logs --lines 20 --nostream # 4. Check system status systemctl status pm2-gitea-runner --no-pager # 5. Check disk space df -h / # 6. Check memory free -h # 7. Check recent deployments (in app directory) cd /var/www/flyer-crawler.projectium.com git log --oneline -5 ``` ### Step 2: Determine Scope | Question | Command | Impact Level | | ------------------------ | ---------------------------------------------------------------- | ------------------------------- | | How many apps affected? | `pm2 list` | Count missing/errored processes | | Is production down? | `curl https://flyer-crawler.projectium.com/api/health/ping` | Yes/No | | Is test down? | `curl https://flyer-crawler-test.projectium.com/api/health/ping` | Yes/No | | Are other apps affected? | `pm2 list \| grep stock-alert` | Yes/No | ### Step 3: Classify Severity ```text Decision Tree: Production app(s) down? | +-- YES: Multiple apps affected? | | | +-- YES --> P1 CRITICAL (all apps down) | | | +-- NO --> P2 HIGH (single app down) | +-- NO: Test environment only? | +-- YES --> P3 MEDIUM | +-- NO --> Investigate further ``` ### Step 4: Document Initial State Capture this information before making any changes: ```bash # Save PM2 state to file pm2 jlist > /tmp/pm2-incident-$(date +%Y%m%d-%H%M%S).json # Save system state { echo "=== PM2 List ===" pm2 list echo "" echo "=== Disk Space ===" df -h echo "" echo "=== Memory ===" free -h echo "" echo "=== Recent Git Commits ===" cd /var/www/flyer-crawler.projectium.com && git log --oneline -5 } > /tmp/incident-state-$(date +%Y%m%d-%H%M%S).txt ``` --- ## Immediate Response ### Priority 1: Stop Ongoing Deployments If a deployment is currently running: 1. Check Gitea Actions for running workflows 2. Cancel any in-progress deployment workflows 3. Do NOT start new deployments until incident resolved ### Priority 2: Assess Which Processes Are Down ```bash # Get list of processes and their status pm2 list # Check which processes exist but are errored/stopped pm2 jlist | jq '.[] | {name, status: .pm2_env.status}' ``` ### Priority 3: Establish Order of Restoration Restore in this order (production first, critical path first): | Priority | Process | Rationale | | -------- | ------------------------------------- | ------------------------------------ | | 1 | `flyer-crawler-api` | Production API - highest user impact | | 2 | `flyer-crawler-worker` | Production background jobs | | 3 | `flyer-crawler-analytics-worker` | Production analytics | | 4 | `stock-alert-*` | Other production apps | | 5 | `flyer-crawler-api-test` | Test environment | | 6 | `flyer-crawler-worker-test` | Test background jobs | | 7 | `flyer-crawler-analytics-worker-test` | Test analytics | --- ## Process Restoration ### Scenario A: Flyer Crawler Production Processes Missing ```bash # Navigate to production directory cd /var/www/flyer-crawler.projectium.com # Start production processes pm2 start ecosystem.config.cjs # Verify processes started pm2 list # Check health endpoint curl -s http://localhost:3001/api/health/ready | jq . ``` ### Scenario B: Flyer Crawler Test Processes Missing ```bash # Navigate to test directory cd /var/www/flyer-crawler-test.projectium.com # Start test processes pm2 start ecosystem-test.config.cjs # Verify processes started pm2 list # Check health endpoint curl -s http://localhost:3002/api/health/ready | jq . ``` ### Scenario C: Stock Alert Processes Missing ```bash # Navigate to stock-alert directory cd /var/www/stock-alert.projectium.com # Start processes (adjust config file name as needed) pm2 start ecosystem.config.cjs # Verify processes started pm2 list ``` ### Scenario D: All Processes Missing Execute restoration in priority order: ```bash # 1. Flyer Crawler Production (highest priority) cd /var/www/flyer-crawler.projectium.com pm2 start ecosystem.config.cjs # Verify production is healthy before continuing curl -s http://localhost:3001/api/health/ready | jq '.data.status' # Should return "healthy" # 2. Stock Alert Production cd /var/www/stock-alert.projectium.com pm2 start ecosystem.config.cjs # 3. Flyer Crawler Test (lower priority) cd /var/www/flyer-crawler-test.projectium.com pm2 start ecosystem-test.config.cjs # 4. Save PM2 process list pm2 save # 5. Final verification pm2 list ``` ### Health Check Verification After restoration, verify each application: **Flyer Crawler Production**: ```bash # API health curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status' # Expected: "healthy" # Check all services curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.services' ``` **Flyer Crawler Test**: ```bash curl -s https://flyer-crawler-test.projectium.com/api/health/ready | jq '.data.status' ``` **Stock Alert**: ```bash # Adjust URL as appropriate for stock-alert curl -s https://stock-alert.projectium.com/api/health/ready | jq '.data.status' ``` ### Verification Checklist After restoration, confirm: - [ ] `pm2 list` shows all expected processes as `online` - [ ] Production health check returns `healthy` - [ ] Test health check returns `healthy` (if applicable) - [ ] No processes showing high restart count - [ ] No processes showing `errored` or `stopped` status - [ ] PM2 process list saved: `pm2 save` --- ## Root Cause Investigation ### Step 1: Check Workflow Execution Logs ```bash # Find recent Gitea Actions runs # (Access via Gitea web UI: Repository > Actions > Recent Runs) # Look for these workflows: # - deploy-to-prod.yml # - deploy-to-test.yml # - manual-deploy-major.yml # - manual-db-restore.yml ``` ### Step 2: Check PM2 Daemon Logs ```bash # PM2 daemon logs cat ~/.pm2/pm2.log | tail -100 # PM2 process-specific logs ls -la ~/.pm2/logs/ # Recent API logs tail -100 ~/.pm2/logs/flyer-crawler-api-out.log tail -100 ~/.pm2/logs/flyer-crawler-api-error.log ``` ### Step 3: Check System Logs ```bash # System journal for PM2 service journalctl -u pm2-gitea-runner -n 100 --no-pager # Kernel messages (OOM killer, etc.) journalctl -k -n 50 --no-pager | grep -i "killed\|oom\|memory" # Authentication logs (unauthorized access) tail -50 /var/log/auth.log ``` ### Step 4: Git History Analysis ```bash # Recent commits to deployment workflows cd /var/www/flyer-crawler.projectium.com git log --oneline -20 -- .gitea/workflows/ # Check what changed in PM2 configs git log --oneline -10 -- ecosystem.config.cjs ecosystem-test.config.cjs # Diff against last known good state git diff -- .gitea/workflows/ ecosystem*.cjs ``` ### Step 5: Timing Correlation Create a timeline: ```text | Time (UTC) | Event | Source | |------------|-------|--------| | XX:XX | Last successful health check | Monitoring | | XX:XX | Deployment workflow started | Gitea Actions | | XX:XX | First failed health check | Monitoring | | XX:XX | Incident detected | User report / Alert | | XX:XX | Investigation started | On-call | ``` ### Common Root Causes | Root Cause | Evidence | Prevention | | ---------------------------- | -------------------------------------- | ---------------------------- | | `pm2 stop all` in workflow | Workflow logs show "all" command | Use explicit process names | | `pm2 delete all` in workflow | Empty PM2 list after deploy | Use whitelist-based deletion | | OOM killer | `journalctl -k` shows "Killed process" | Increase memory limits | | Disk space exhaustion | `df -h` shows 100% | Log rotation, cleanup | | Manual intervention | Shell history shows pm2 commands | Document all manual actions | | Concurrent deployments | Multiple workflows at same time | Implement deployment locks | | Workflow caching issue | Old workflow version executed | Force workflow refresh | --- ## Communication Templates ### Incident Notification (Internal) ```text Subject: [P1 INCIDENT] PM2 Process Isolation Failure - Multiple Apps Down Status: INVESTIGATING Time Detected: YYYY-MM-DD HH:MM UTC Affected Systems: [flyer-crawler-prod, stock-alert-prod, ...] Summary: All PM2 processes on projectium.com server were terminated unexpectedly. Multiple production applications are currently down. Impact: - flyer-crawler.projectium.com: DOWN - stock-alert.projectium.com: DOWN - [other affected apps] Current Actions: - Restoring critical production processes - Investigating root cause Next Update: In 15 minutes or upon status change Incident Commander: [Name] ``` ### Status Update Template ```text Subject: [P1 INCIDENT] PM2 Process Isolation Failure - UPDATE #N Status: [INVESTIGATING | IDENTIFIED | RESTORING | RESOLVED] Time: YYYY-MM-DD HH:MM UTC Progress Since Last Update: - [Action taken] - [Discovery made] - [Process restored] Current State: - flyer-crawler.projectium.com: [UP|DOWN] - stock-alert.projectium.com: [UP|DOWN] Root Cause: [If identified] Next Steps: - [Planned action] ETA to Resolution: [If known] Next Update: In [X] minutes ``` ### Resolution Notification ```text Subject: [RESOLVED] PM2 Process Isolation Failure Status: RESOLVED Time Resolved: YYYY-MM-DD HH:MM UTC Total Downtime: X minutes Summary: All PM2 processes have been restored. Services are operating normally. Root Cause: [Brief description of what caused the incident] Impact Summary: - flyer-crawler.projectium.com: Down for X minutes - stock-alert.projectium.com: Down for X minutes - Estimated user impact: [description] Immediate Actions Taken: 1. [Action] 2. [Action] Follow-up Actions: 1. [ ] [Preventive measure] - Owner: [Name] - Due: [Date] 2. [ ] Post-incident review scheduled for [Date] Post-Incident Review: [Link or scheduled time] ``` --- ## Prevention Measures ### Pre-Deployment Checklist Before triggering any deployment: - [ ] Review workflow file for PM2 commands - [ ] Confirm no `pm2 stop all`, `pm2 delete all`, or `pm2 restart all` - [ ] Verify process names are explicitly listed - [ ] Check for concurrent deployment risks - [ ] Confirm recent workflow changes were reviewed ### Workflow Review Checklist When reviewing deployment workflow changes: - [ ] All PM2 `stop` commands use explicit process names - [ ] All PM2 `delete` commands filter by process name pattern - [ ] All PM2 `restart` commands use explicit process names - [ ] Test deployments filter by `-test` suffix - [ ] Production deployments use whitelist array **Safe Patterns**: ```javascript // SAFE: Explicit process names (production) const prodProcesses = [ 'flyer-crawler-api', 'flyer-crawler-worker', 'flyer-crawler-analytics-worker', ]; list.forEach((p) => { if ( (p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') && prodProcesses.includes(p.name) ) { exec('pm2 delete ' + p.pm2_env.pm_id); } }); // SAFE: Pattern-based filtering (test) list.forEach((p) => { if (p.name && p.name.endsWith('-test')) { exec('pm2 delete ' + p.pm2_env.pm_id); } }); ``` **Dangerous Patterns** (NEVER USE): ```bash # DANGEROUS - affects ALL applications pm2 stop all pm2 delete all pm2 restart all # DANGEROUS - no name filtering pm2 delete $(pm2 jlist | jq -r '.[] | select(.pm2_env.status == "errored") | .pm_id') ``` ### PM2 Configuration Validation Before deploying PM2 config changes: ```bash # Test configuration locally cd /var/www/flyer-crawler.projectium.com node -e "console.log(JSON.stringify(require('./ecosystem.config.cjs'), null, 2))" # Verify process names node -e "require('./ecosystem.config.cjs').apps.forEach(a => console.log(a.name))" # Expected output should match documented process names ``` ### Deployment Monitoring After every deployment: ```bash # Immediate verification pm2 list # Check no unexpected processes were affected pm2 list | grep -v flyer-crawler # Should still show other apps (e.g., stock-alert) # Health check curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status' ``` --- ## Contact Information ### On-Call Escalation | Role | Contact | When to Escalate | | ----------------- | -------------- | ----------------------------------- | | Primary On-Call | [Name/Channel] | First responder | | Secondary On-Call | [Name/Channel] | If primary unavailable after 10 min | | Engineering Lead | [Name/Channel] | P1 incidents > 30 min | | Product Owner | [Name/Channel] | User communication needed | ### External Dependencies | Service | Support Channel | When to Contact | | --------------- | --------------- | ----------------------- | | Server Provider | [Contact info] | Hardware/network issues | | DNS Provider | [Contact info] | DNS resolution failures | | SSL Certificate | [Contact info] | Certificate issues | ### Communication Channels | Channel | Purpose | | -------------- | -------------------------- | | `#incidents` | Real-time incident updates | | `#deployments` | Deployment announcements | | `#engineering` | Technical discussion | | Email list | Formal notifications | --- ## Post-Incident Review ### Incident Report Template ```markdown # Incident Report: [Title] ## Overview | Field | Value | | ------------------ | ----------------- | | Date | YYYY-MM-DD | | Duration | X hours Y minutes | | Severity | P1/P2/P3 | | Incident Commander | [Name] | | Status | Resolved | ## Timeline | Time (UTC) | Event | | ---------- | ------------------- | | HH:MM | [Event description] | | HH:MM | [Event description] | ## Impact - **Users affected**: [Number/description] - **Revenue impact**: [If applicable] - **SLA impact**: [If applicable] ## Root Cause [Detailed technical explanation] ## Resolution [What was done to resolve the incident] ## Contributing Factors 1. [Factor] 2. [Factor] ## Action Items | Action | Owner | Due Date | Status | | -------- | ------ | -------- | ------ | | [Action] | [Name] | [Date] | [ ] | ## Lessons Learned ### What Went Well - [Item] ### What Could Be Improved - [Item] ## Appendix - Link to monitoring data - Link to relevant logs - Link to workflow runs ``` ### Lessons Learned Format Use "5 Whys" technique: ```text Problem: All PM2 processes were killed during deployment Why 1: The deployment workflow ran `pm2 delete all` Why 2: The workflow used an outdated version of the script Why 3: Gitea runner cached the old workflow file Why 4: No mechanism to verify workflow version before execution Why 5: Workflow versioning and audit trail not implemented Root Cause: Lack of workflow versioning and execution verification Preventive Measure: Implement workflow hash logging and pre-execution verification ``` ### Action Items Tracking Create Gitea issues for each action item: ```bash # Example using Gitea CLI or API gh issue create --title "Implement PM2 state logging in deployment workflows" \ --body "Related to incident YYYY-MM-DD. Add pre-deployment PM2 state capture." \ --label "incident-follow-up,priority:high" ``` Track action items in a central location: | Issue # | Action | Owner | Due | Status | | ------- | -------------------------------- | ------ | ------ | ------ | | #123 | Add PM2 state logging | [Name] | [Date] | Open | | #124 | Implement workflow version hash | [Name] | [Date] | Open | | #125 | Create deployment lock mechanism | [Name] | [Date] | Open | --- ## Appendix: PM2 Command Reference ### Safe Commands ```bash # Status and monitoring pm2 list pm2 show pm2 monit pm2 logs # Restart specific processes pm2 restart flyer-crawler-api pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker # Reload (zero-downtime, cluster mode only) pm2 reload flyer-crawler-api # Start from config pm2 start ecosystem.config.cjs pm2 start ecosystem.config.cjs --only flyer-crawler-api ``` ### Dangerous Commands (Use With Caution) ```bash # CAUTION: These affect ALL processes pm2 stop all # Stops every PM2 process pm2 restart all # Restarts every PM2 process pm2 delete all # Removes every PM2 process # CAUTION: Modifies saved process list pm2 save # Overwrites saved process list pm2 resurrect # Restores from saved list # CAUTION: Affects PM2 daemon pm2 kill # Kills PM2 daemon and all processes pm2 update # Updates PM2 in place (may cause brief outage) ``` --- ## Revision History | Date | Author | Change | | ---------- | ---------------------- | ------------------------ | | 2026-02-17 | Incident Response Team | Initial runbook creation |