torbo/flyer-crawler.projectium.com

Fork 0

Files

Torben Sorensen c059b30201

Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s

Details

PM2 Process Isolation

2026-02-17 20:49:01 -08:00

22 KiB

Raw Blame History

PM2 Incident Response Runbook

Purpose: Step-by-step procedures for responding to PM2 process isolation incidents on the projectium.com server.

Audience: On-call responders, system administrators, developers with server access.

Last updated: 2026-02-17

Related documentation:

Quick Reference
Detection
Initial Assessment
Immediate Response
Process Restoration
Root Cause Investigation
Communication Templates
Prevention Measures
Contact Information
Post-Incident Review

Quick Reference

PM2 Process Inventory

Application	Environment	Process Names	Config File	Directory
Flyer Crawler	Production	`flyer-crawler-api`, `flyer-crawler-worker`, `flyer-crawler-analytics-worker`	`ecosystem.config.cjs`	`/var/www/flyer-crawler.projectium.com`
Flyer Crawler	Test	`flyer-crawler-api-test`, `flyer-crawler-worker-test`, `flyer-crawler-analytics-worker-test`	`ecosystem-test.config.cjs`	`/var/www/flyer-crawler-test.projectium.com`
Stock Alert	Production	`stock-alert-*`	(varies)	`/var/www/stock-alert.projectium.com`

Critical Commands

# Check PM2 status
pm2 list

# Check specific process
pm2 show flyer-crawler-api

# View recent logs
pm2 logs --lines 50

# Restart specific processes (SAFE)
pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker

# DO NOT USE (affects ALL apps)
# pm2 restart all    <-- DANGEROUS
# pm2 stop all       <-- DANGEROUS
# pm2 delete all     <-- DANGEROUS

Severity Classification

Severity	Criteria	Response Time	Example
P1 - Critical	Multiple applications down, production impact	Immediate (< 5 min)	All PM2 processes killed
P2 - High	Single application down, production impact	< 15 min	Flyer Crawler prod down, Stock Alert unaffected
P3 - Medium	Test environment only, no production impact	< 1 hour	Test processes killed, production unaffected

Detection

How to Identify a PM2 Incident

Automated Indicators:

Health check failures on /api/health/ready
Monitoring alerts (UptimeRobot, etc.)
Bugsink showing connection errors
NGINX returning 502 Bad Gateway

User-Reported Symptoms:

"The site is down"
"I can't log in"
"Pages are loading slowly then timing out"
"I see a 502 error"

Manual Discovery:

# SSH to server
ssh gitea-runner@projectium.com

# Check if PM2 is running
pm2 list

# Expected output shows processes
# If empty or all errored = incident

Incident Signature: Process Isolation Violation

When a PM2 incident is caused by process isolation failure, you will see:

# Expected state (normal):
+-----------------------------------+----+-----+---------+-------+
| App name                          | id |mode | status  | cpu   |
+-----------------------------------+----+-----+---------+-------+
| flyer-crawler-api                 | 0  |clust| online  | 0%    |
| flyer-crawler-worker              | 1  |fork | online  | 0%    |
| flyer-crawler-analytics-worker    | 2  |fork | online  | 0%    |
| flyer-crawler-api-test            | 3  |fork | online  | 0%    |
| flyer-crawler-worker-test         | 4  |fork | online  | 0%    |
| flyer-crawler-analytics-worker-test| 5 |fork | online  | 0%    |
| stock-alert-api                   | 6  |fork | online  | 0%    |
+-----------------------------------+----+-----+---------+-------+

# Incident state (isolation violation):
# All processes missing or errored - not just one app
+-----------------------------------+----+-----+---------+-------+
| App name                          | id |mode | status  | cpu   |
+-----------------------------------+----+-----+---------+-------+
# (empty or all processes errored/stopped)
+-----------------------------------+----+-----+---------+-------+

Initial Assessment

Step 1: Gather Information (2 minutes)

Run these commands and capture output:

# 1. Check PM2 status
pm2 list

# 2. Check PM2 daemon status
pm2 ping

# 3. Check recent PM2 logs
pm2 logs --lines 20 --nostream

# 4. Check system status
systemctl status pm2-gitea-runner --no-pager

# 5. Check disk space
df -h /

# 6. Check memory
free -h

# 7. Check recent deployments (in app directory)
cd /var/www/flyer-crawler.projectium.com
git log --oneline -5

Step 2: Determine Scope

Question	Command	Impact Level
How many apps affected?	`pm2 list`	Count missing/errored processes
Is production down?	`curl https://flyer-crawler.projectium.com/api/health/ping`	Yes/No
Is test down?	`curl https://flyer-crawler-test.projectium.com/api/health/ping`	Yes/No
Are other apps affected?	`pm2 list \| grep stock-alert`	Yes/No

Step 3: Classify Severity

Decision Tree:

Production app(s) down?
    |
    +-- YES: Multiple apps affected?
    |       |
    |       +-- YES --> P1 CRITICAL (all apps down)
    |       |
    |       +-- NO --> P2 HIGH (single app down)
    |
    +-- NO: Test environment only?
            |
            +-- YES --> P3 MEDIUM
            |
            +-- NO --> Investigate further

Step 4: Document Initial State

Capture this information before making any changes:

# Save PM2 state to file
pm2 jlist > /tmp/pm2-incident-$(date +%Y%m%d-%H%M%S).json

# Save system state
{
  echo "=== PM2 List ==="
  pm2 list
  echo ""
  echo "=== Disk Space ==="
  df -h
  echo ""
  echo "=== Memory ==="
  free -h
  echo ""
  echo "=== Recent Git Commits ==="
  cd /var/www/flyer-crawler.projectium.com && git log --oneline -5
} > /tmp/incident-state-$(date +%Y%m%d-%H%M%S).txt

Immediate Response

Priority 1: Stop Ongoing Deployments

If a deployment is currently running:

Check Gitea Actions for running workflows
Cancel any in-progress deployment workflows
Do NOT start new deployments until incident resolved

Priority 2: Assess Which Processes Are Down

# Get list of processes and their status
pm2 list

# Check which processes exist but are errored/stopped
pm2 jlist | jq '.[] | {name, status: .pm2_env.status}'

Priority 3: Establish Order of Restoration

Restore in this order (production first, critical path first):

Priority	Process	Rationale
1	`flyer-crawler-api`	Production API - highest user impact
2	`flyer-crawler-worker`	Production background jobs
3	`flyer-crawler-analytics-worker`	Production analytics
4	`stock-alert-*`	Other production apps
5	`flyer-crawler-api-test`	Test environment
6	`flyer-crawler-worker-test`	Test background jobs
7	`flyer-crawler-analytics-worker-test`	Test analytics

Process Restoration

Scenario A: Flyer Crawler Production Processes Missing

# Navigate to production directory
cd /var/www/flyer-crawler.projectium.com

# Start production processes
pm2 start ecosystem.config.cjs

# Verify processes started
pm2 list

# Check health endpoint
curl -s http://localhost:3001/api/health/ready | jq .

Scenario B: Flyer Crawler Test Processes Missing

# Navigate to test directory
cd /var/www/flyer-crawler-test.projectium.com

# Start test processes
pm2 start ecosystem-test.config.cjs

# Verify processes started
pm2 list

# Check health endpoint
curl -s http://localhost:3002/api/health/ready | jq .

Scenario C: Stock Alert Processes Missing

# Navigate to stock-alert directory
cd /var/www/stock-alert.projectium.com

# Start processes (adjust config file name as needed)
pm2 start ecosystem.config.cjs

# Verify processes started
pm2 list

Scenario D: All Processes Missing

Execute restoration in priority order:

# 1. Flyer Crawler Production (highest priority)
cd /var/www/flyer-crawler.projectium.com
pm2 start ecosystem.config.cjs

# Verify production is healthy before continuing
curl -s http://localhost:3001/api/health/ready | jq '.data.status'
# Should return "healthy"

# 2. Stock Alert Production
cd /var/www/stock-alert.projectium.com
pm2 start ecosystem.config.cjs

# 3. Flyer Crawler Test (lower priority)
cd /var/www/flyer-crawler-test.projectium.com
pm2 start ecosystem-test.config.cjs

# 4. Save PM2 process list
pm2 save

# 5. Final verification
pm2 list

Health Check Verification

After restoration, verify each application:

Flyer Crawler Production:

# API health
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'
# Expected: "healthy"

# Check all services
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.services'

Flyer Crawler Test:

curl -s https://flyer-crawler-test.projectium.com/api/health/ready | jq '.data.status'

Stock Alert:

# Adjust URL as appropriate for stock-alert
curl -s https://stock-alert.projectium.com/api/health/ready | jq '.data.status'

Verification Checklist

After restoration, confirm:

pm2 list shows all expected processes as online
Production health check returns healthy
Test health check returns healthy (if applicable)
No processes showing high restart count
No processes showing errored or stopped status
PM2 process list saved: pm2 save

Root Cause Investigation

Step 1: Check Workflow Execution Logs

# Find recent Gitea Actions runs
# (Access via Gitea web UI: Repository > Actions > Recent Runs)

# Look for these workflows:
# - deploy-to-prod.yml
# - deploy-to-test.yml
# - manual-deploy-major.yml
# - manual-db-restore.yml

Step 2: Check PM2 Daemon Logs

# PM2 daemon logs
cat ~/.pm2/pm2.log | tail -100

# PM2 process-specific logs
ls -la ~/.pm2/logs/

# Recent API logs
tail -100 ~/.pm2/logs/flyer-crawler-api-out.log
tail -100 ~/.pm2/logs/flyer-crawler-api-error.log

Step 3: Check System Logs

# System journal for PM2 service
journalctl -u pm2-gitea-runner -n 100 --no-pager

# Kernel messages (OOM killer, etc.)
journalctl -k -n 50 --no-pager | grep -i "killed\|oom\|memory"

# Authentication logs (unauthorized access)
tail -50 /var/log/auth.log

Step 4: Git History Analysis

# Recent commits to deployment workflows
cd /var/www/flyer-crawler.projectium.com
git log --oneline -20 -- .gitea/workflows/

# Check what changed in PM2 configs
git log --oneline -10 -- ecosystem.config.cjs ecosystem-test.config.cjs

# Diff against last known good state
git diff <last-good-commit> -- .gitea/workflows/ ecosystem*.cjs

Step 5: Timing Correlation

Create a timeline:

| Time (UTC) | Event | Source |
|------------|-------|--------|
| XX:XX | Last successful health check | Monitoring |
| XX:XX | Deployment workflow started | Gitea Actions |
| XX:XX | First failed health check | Monitoring |
| XX:XX | Incident detected | User report / Alert |
| XX:XX | Investigation started | On-call |

Common Root Causes

Root Cause	Evidence	Prevention
`pm2 stop all` in workflow	Workflow logs show "all" command	Use explicit process names
`pm2 delete all` in workflow	Empty PM2 list after deploy	Use whitelist-based deletion
OOM killer	`journalctl -k` shows "Killed process"	Increase memory limits
Disk space exhaustion	`df -h` shows 100%	Log rotation, cleanup
Manual intervention	Shell history shows pm2 commands	Document all manual actions
Concurrent deployments	Multiple workflows at same time	Implement deployment locks
Workflow caching issue	Old workflow version executed	Force workflow refresh

Communication Templates

Incident Notification (Internal)

Subject: [P1 INCIDENT] PM2 Process Isolation Failure - Multiple Apps Down

Status: INVESTIGATING
Time Detected: YYYY-MM-DD HH:MM UTC
Affected Systems: [flyer-crawler-prod, stock-alert-prod, ...]

Summary:
All PM2 processes on projectium.com server were terminated unexpectedly.
Multiple production applications are currently down.

Impact:
- flyer-crawler.projectium.com: DOWN
- stock-alert.projectium.com: DOWN
- [other affected apps]

Current Actions:
- Restoring critical production processes
- Investigating root cause

Next Update: In 15 minutes or upon status change

Incident Commander: [Name]

Status Update Template

Subject: [P1 INCIDENT] PM2 Process Isolation Failure - UPDATE #N

Status: [INVESTIGATING | IDENTIFIED | RESTORING | RESOLVED]
Time: YYYY-MM-DD HH:MM UTC

Progress Since Last Update:
- [Action taken]
- [Discovery made]
- [Process restored]

Current State:
- flyer-crawler.projectium.com: [UP|DOWN]
- stock-alert.projectium.com: [UP|DOWN]

Root Cause: [If identified]

Next Steps:
- [Planned action]

ETA to Resolution: [If known]

Next Update: In [X] minutes

Resolution Notification

Subject: [RESOLVED] PM2 Process Isolation Failure

Status: RESOLVED
Time Resolved: YYYY-MM-DD HH:MM UTC
Total Downtime: X minutes

Summary:
All PM2 processes have been restored. Services are operating normally.

Root Cause:
[Brief description of what caused the incident]

Impact Summary:
- flyer-crawler.projectium.com: Down for X minutes
- stock-alert.projectium.com: Down for X minutes
- Estimated user impact: [description]

Immediate Actions Taken:
1. [Action]
2. [Action]

Follow-up Actions:
1. [ ] [Preventive measure] - Owner: [Name] - Due: [Date]
2. [ ] Post-incident review scheduled for [Date]

Post-Incident Review: [Link or scheduled time]

Prevention Measures

Pre-Deployment Checklist

Before triggering any deployment:

Review workflow file for PM2 commands
Confirm no pm2 stop all, pm2 delete all, or pm2 restart all
Verify process names are explicitly listed
Check for concurrent deployment risks
Confirm recent workflow changes were reviewed

Workflow Review Checklist

When reviewing deployment workflow changes:

All PM2 stop commands use explicit process names
All PM2 delete commands filter by process name pattern
All PM2 restart commands use explicit process names
Test deployments filter by -test suffix
Production deployments use whitelist array

Safe Patterns:

// SAFE: Explicit process names (production)
const prodProcesses = [
  'flyer-crawler-api',
  'flyer-crawler-worker',
  'flyer-crawler-analytics-worker',
];
list.forEach((p) => {
  if (
    (p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
    prodProcesses.includes(p.name)
  ) {
    exec('pm2 delete ' + p.pm2_env.pm_id);
  }
});

// SAFE: Pattern-based filtering (test)
list.forEach((p) => {
  if (p.name && p.name.endsWith('-test')) {
    exec('pm2 delete ' + p.pm2_env.pm_id);
  }
});

Dangerous Patterns (NEVER USE):

# DANGEROUS - affects ALL applications
pm2 stop all
pm2 delete all
pm2 restart all

# DANGEROUS - no name filtering
pm2 delete $(pm2 jlist | jq -r '.[] | select(.pm2_env.status == "errored") | .pm_id')

PM2 Configuration Validation

Before deploying PM2 config changes:

# Test configuration locally
cd /var/www/flyer-crawler.projectium.com
node -e "console.log(JSON.stringify(require('./ecosystem.config.cjs'), null, 2))"

# Verify process names
node -e "require('./ecosystem.config.cjs').apps.forEach(a => console.log(a.name))"

# Expected output should match documented process names

Deployment Monitoring

After every deployment:

# Immediate verification
pm2 list

# Check no unexpected processes were affected
pm2 list | grep -v flyer-crawler
# Should still show other apps (e.g., stock-alert)

# Health check
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'

Contact Information

On-Call Escalation

Role	Contact	When to Escalate
Primary On-Call	[Name/Channel]	First responder
Secondary On-Call	[Name/Channel]	If primary unavailable after 10 min
Engineering Lead	[Name/Channel]	P1 incidents > 30 min
Product Owner	[Name/Channel]	User communication needed

External Dependencies

Service	Support Channel	When to Contact
Server Provider	[Contact info]	Hardware/network issues
DNS Provider	[Contact info]	DNS resolution failures
SSL Certificate	[Contact info]	Certificate issues

Communication Channels

Channel	Purpose
`#incidents`	Real-time incident updates
`#deployments`	Deployment announcements
`#engineering`	Technical discussion
Email list	Formal notifications

Post-Incident Review

Incident Report Template

# Incident Report: [Title]

## Overview

| Field              | Value             |
| ------------------ | ----------------- |
| Date               | YYYY-MM-DD        |
| Duration           | X hours Y minutes |
| Severity           | P1/P2/P3          |
| Incident Commander | [Name]            |
| Status             | Resolved          |

## Timeline

| Time (UTC) | Event               |
| ---------- | ------------------- |
| HH:MM      | [Event description] |
| HH:MM      | [Event description] |

## Impact

- **Users affected**: [Number/description]
- **Revenue impact**: [If applicable]
- **SLA impact**: [If applicable]

## Root Cause

[Detailed technical explanation]

## Resolution

[What was done to resolve the incident]

## Contributing Factors

1. [Factor]
2. [Factor]

## Action Items

| Action   | Owner  | Due Date | Status |
| -------- | ------ | -------- | ------ |
| [Action] | [Name] | [Date]   | [ ]    |

## Lessons Learned

### What Went Well

- [Item]

### What Could Be Improved

- [Item]

## Appendix

- Link to monitoring data
- Link to relevant logs
- Link to workflow runs

Lessons Learned Format

Use "5 Whys" technique:

Problem: All PM2 processes were killed during deployment

Why 1: The deployment workflow ran `pm2 delete all`
Why 2: The workflow used an outdated version of the script
Why 3: Gitea runner cached the old workflow file
Why 4: No mechanism to verify workflow version before execution
Why 5: Workflow versioning and audit trail not implemented

Root Cause: Lack of workflow versioning and execution verification

Preventive Measure: Implement workflow hash logging and pre-execution verification

Action Items Tracking

Create Gitea issues for each action item:

# Example using Gitea CLI or API
gh issue create --title "Implement PM2 state logging in deployment workflows" \
  --body "Related to incident YYYY-MM-DD. Add pre-deployment PM2 state capture." \
  --label "incident-follow-up,priority:high"

Track action items in a central location:

Issue #	Action	Owner	Due	Status
#123	Add PM2 state logging	[Name]	[Date]	Open
#124	Implement workflow version hash	[Name]	[Date]	Open
#125	Create deployment lock mechanism	[Name]	[Date]	Open

Appendix: PM2 Command Reference

Safe Commands

# Status and monitoring
pm2 list
pm2 show <process-name>
pm2 monit
pm2 logs <process-name>

# Restart specific processes
pm2 restart flyer-crawler-api
pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker

# Reload (zero-downtime, cluster mode only)
pm2 reload flyer-crawler-api

# Start from config
pm2 start ecosystem.config.cjs
pm2 start ecosystem.config.cjs --only flyer-crawler-api

Dangerous Commands (Use With Caution)

# CAUTION: These affect ALL processes
pm2 stop all        # Stops every PM2 process
pm2 restart all     # Restarts every PM2 process
pm2 delete all      # Removes every PM2 process

# CAUTION: Modifies saved process list
pm2 save            # Overwrites saved process list
pm2 resurrect       # Restores from saved list

# CAUTION: Affects PM2 daemon
pm2 kill            # Kills PM2 daemon and all processes
pm2 update          # Updates PM2 in place (may cause brief outage)

Revision History

Date	Author	Change
2026-02-17	Incident Response Team	Initial runbook creation

22 KiB Raw Blame History