All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s
819 lines
22 KiB
Markdown
819 lines
22 KiB
Markdown
# PM2 Incident Response Runbook
|
|
|
|
**Purpose**: Step-by-step procedures for responding to PM2 process isolation incidents on the projectium.com server.
|
|
|
|
**Audience**: On-call responders, system administrators, developers with server access.
|
|
|
|
**Last updated**: 2026-02-17
|
|
|
|
**Related documentation**:
|
|
|
|
- [CLAUDE.md - PM2 Process Isolation Rules](../../CLAUDE.md)
|
|
- [Incident Report: 2026-02-17](INCIDENT-2026-02-17-PM2-PROCESS-KILL.md)
|
|
- [Monitoring Guide](MONITORING.md)
|
|
- [Deployment Guide](DEPLOYMENT.md)
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Quick Reference](#quick-reference)
|
|
2. [Detection](#detection)
|
|
3. [Initial Assessment](#initial-assessment)
|
|
4. [Immediate Response](#immediate-response)
|
|
5. [Process Restoration](#process-restoration)
|
|
6. [Root Cause Investigation](#root-cause-investigation)
|
|
7. [Communication Templates](#communication-templates)
|
|
8. [Prevention Measures](#prevention-measures)
|
|
9. [Contact Information](#contact-information)
|
|
10. [Post-Incident Review](#post-incident-review)
|
|
|
|
---
|
|
|
|
## Quick Reference
|
|
|
|
### PM2 Process Inventory
|
|
|
|
| Application | Environment | Process Names | Config File | Directory |
|
|
| ------------- | ----------- | -------------------------------------------------------------------------------------------- | --------------------------- | -------------------------------------------- |
|
|
| Flyer Crawler | Production | `flyer-crawler-api`, `flyer-crawler-worker`, `flyer-crawler-analytics-worker` | `ecosystem.config.cjs` | `/var/www/flyer-crawler.projectium.com` |
|
|
| Flyer Crawler | Test | `flyer-crawler-api-test`, `flyer-crawler-worker-test`, `flyer-crawler-analytics-worker-test` | `ecosystem-test.config.cjs` | `/var/www/flyer-crawler-test.projectium.com` |
|
|
| Stock Alert | Production | `stock-alert-*` | (varies) | `/var/www/stock-alert.projectium.com` |
|
|
|
|
### Critical Commands
|
|
|
|
```bash
|
|
# Check PM2 status
|
|
pm2 list
|
|
|
|
# Check specific process
|
|
pm2 show flyer-crawler-api
|
|
|
|
# View recent logs
|
|
pm2 logs --lines 50
|
|
|
|
# Restart specific processes (SAFE)
|
|
pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker
|
|
|
|
# DO NOT USE (affects ALL apps)
|
|
# pm2 restart all <-- DANGEROUS
|
|
# pm2 stop all <-- DANGEROUS
|
|
# pm2 delete all <-- DANGEROUS
|
|
```
|
|
|
|
### Severity Classification
|
|
|
|
| Severity | Criteria | Response Time | Example |
|
|
| ----------------- | --------------------------------------------- | ------------------- | ----------------------------------------------- |
|
|
| **P1 - Critical** | Multiple applications down, production impact | Immediate (< 5 min) | All PM2 processes killed |
|
|
| **P2 - High** | Single application down, production impact | < 15 min | Flyer Crawler prod down, Stock Alert unaffected |
|
|
| **P3 - Medium** | Test environment only, no production impact | < 1 hour | Test processes killed, production unaffected |
|
|
|
|
---
|
|
|
|
## Detection
|
|
|
|
### How to Identify a PM2 Incident
|
|
|
|
**Automated Indicators**:
|
|
|
|
- Health check failures on `/api/health/ready`
|
|
- Monitoring alerts (UptimeRobot, etc.)
|
|
- Bugsink showing connection errors
|
|
- NGINX returning 502 Bad Gateway
|
|
|
|
**User-Reported Symptoms**:
|
|
|
|
- "The site is down"
|
|
- "I can't log in"
|
|
- "Pages are loading slowly then timing out"
|
|
- "I see a 502 error"
|
|
|
|
**Manual Discovery**:
|
|
|
|
```bash
|
|
# SSH to server
|
|
ssh gitea-runner@projectium.com
|
|
|
|
# Check if PM2 is running
|
|
pm2 list
|
|
|
|
# Expected output shows processes
|
|
# If empty or all errored = incident
|
|
```
|
|
|
|
### Incident Signature: Process Isolation Violation
|
|
|
|
When a PM2 incident is caused by process isolation failure, you will see:
|
|
|
|
```text
|
|
# Expected state (normal):
|
|
+-----------------------------------+----+-----+---------+-------+
|
|
| App name | id |mode | status | cpu |
|
|
+-----------------------------------+----+-----+---------+-------+
|
|
| flyer-crawler-api | 0 |clust| online | 0% |
|
|
| flyer-crawler-worker | 1 |fork | online | 0% |
|
|
| flyer-crawler-analytics-worker | 2 |fork | online | 0% |
|
|
| flyer-crawler-api-test | 3 |fork | online | 0% |
|
|
| flyer-crawler-worker-test | 4 |fork | online | 0% |
|
|
| flyer-crawler-analytics-worker-test| 5 |fork | online | 0% |
|
|
| stock-alert-api | 6 |fork | online | 0% |
|
|
+-----------------------------------+----+-----+---------+-------+
|
|
|
|
# Incident state (isolation violation):
|
|
# All processes missing or errored - not just one app
|
|
+-----------------------------------+----+-----+---------+-------+
|
|
| App name | id |mode | status | cpu |
|
|
+-----------------------------------+----+-----+---------+-------+
|
|
# (empty or all processes errored/stopped)
|
|
+-----------------------------------+----+-----+---------+-------+
|
|
```
|
|
|
|
---
|
|
|
|
## Initial Assessment
|
|
|
|
### Step 1: Gather Information (2 minutes)
|
|
|
|
Run these commands and capture output:
|
|
|
|
```bash
|
|
# 1. Check PM2 status
|
|
pm2 list
|
|
|
|
# 2. Check PM2 daemon status
|
|
pm2 ping
|
|
|
|
# 3. Check recent PM2 logs
|
|
pm2 logs --lines 20 --nostream
|
|
|
|
# 4. Check system status
|
|
systemctl status pm2-gitea-runner --no-pager
|
|
|
|
# 5. Check disk space
|
|
df -h /
|
|
|
|
# 6. Check memory
|
|
free -h
|
|
|
|
# 7. Check recent deployments (in app directory)
|
|
cd /var/www/flyer-crawler.projectium.com
|
|
git log --oneline -5
|
|
```
|
|
|
|
### Step 2: Determine Scope
|
|
|
|
| Question | Command | Impact Level |
|
|
| ------------------------ | ---------------------------------------------------------------- | ------------------------------- |
|
|
| How many apps affected? | `pm2 list` | Count missing/errored processes |
|
|
| Is production down? | `curl https://flyer-crawler.projectium.com/api/health/ping` | Yes/No |
|
|
| Is test down? | `curl https://flyer-crawler-test.projectium.com/api/health/ping` | Yes/No |
|
|
| Are other apps affected? | `pm2 list \| grep stock-alert` | Yes/No |
|
|
|
|
### Step 3: Classify Severity
|
|
|
|
```text
|
|
Decision Tree:
|
|
|
|
Production app(s) down?
|
|
|
|
|
+-- YES: Multiple apps affected?
|
|
| |
|
|
| +-- YES --> P1 CRITICAL (all apps down)
|
|
| |
|
|
| +-- NO --> P2 HIGH (single app down)
|
|
|
|
|
+-- NO: Test environment only?
|
|
|
|
|
+-- YES --> P3 MEDIUM
|
|
|
|
|
+-- NO --> Investigate further
|
|
```
|
|
|
|
### Step 4: Document Initial State
|
|
|
|
Capture this information before making any changes:
|
|
|
|
```bash
|
|
# Save PM2 state to file
|
|
pm2 jlist > /tmp/pm2-incident-$(date +%Y%m%d-%H%M%S).json
|
|
|
|
# Save system state
|
|
{
|
|
echo "=== PM2 List ==="
|
|
pm2 list
|
|
echo ""
|
|
echo "=== Disk Space ==="
|
|
df -h
|
|
echo ""
|
|
echo "=== Memory ==="
|
|
free -h
|
|
echo ""
|
|
echo "=== Recent Git Commits ==="
|
|
cd /var/www/flyer-crawler.projectium.com && git log --oneline -5
|
|
} > /tmp/incident-state-$(date +%Y%m%d-%H%M%S).txt
|
|
```
|
|
|
|
---
|
|
|
|
## Immediate Response
|
|
|
|
### Priority 1: Stop Ongoing Deployments
|
|
|
|
If a deployment is currently running:
|
|
|
|
1. Check Gitea Actions for running workflows
|
|
2. Cancel any in-progress deployment workflows
|
|
3. Do NOT start new deployments until incident resolved
|
|
|
|
### Priority 2: Assess Which Processes Are Down
|
|
|
|
```bash
|
|
# Get list of processes and their status
|
|
pm2 list
|
|
|
|
# Check which processes exist but are errored/stopped
|
|
pm2 jlist | jq '.[] | {name, status: .pm2_env.status}'
|
|
```
|
|
|
|
### Priority 3: Establish Order of Restoration
|
|
|
|
Restore in this order (production first, critical path first):
|
|
|
|
| Priority | Process | Rationale |
|
|
| -------- | ------------------------------------- | ------------------------------------ |
|
|
| 1 | `flyer-crawler-api` | Production API - highest user impact |
|
|
| 2 | `flyer-crawler-worker` | Production background jobs |
|
|
| 3 | `flyer-crawler-analytics-worker` | Production analytics |
|
|
| 4 | `stock-alert-*` | Other production apps |
|
|
| 5 | `flyer-crawler-api-test` | Test environment |
|
|
| 6 | `flyer-crawler-worker-test` | Test background jobs |
|
|
| 7 | `flyer-crawler-analytics-worker-test` | Test analytics |
|
|
|
|
---
|
|
|
|
## Process Restoration
|
|
|
|
### Scenario A: Flyer Crawler Production Processes Missing
|
|
|
|
```bash
|
|
# Navigate to production directory
|
|
cd /var/www/flyer-crawler.projectium.com
|
|
|
|
# Start production processes
|
|
pm2 start ecosystem.config.cjs
|
|
|
|
# Verify processes started
|
|
pm2 list
|
|
|
|
# Check health endpoint
|
|
curl -s http://localhost:3001/api/health/ready | jq .
|
|
```
|
|
|
|
### Scenario B: Flyer Crawler Test Processes Missing
|
|
|
|
```bash
|
|
# Navigate to test directory
|
|
cd /var/www/flyer-crawler-test.projectium.com
|
|
|
|
# Start test processes
|
|
pm2 start ecosystem-test.config.cjs
|
|
|
|
# Verify processes started
|
|
pm2 list
|
|
|
|
# Check health endpoint
|
|
curl -s http://localhost:3002/api/health/ready | jq .
|
|
```
|
|
|
|
### Scenario C: Stock Alert Processes Missing
|
|
|
|
```bash
|
|
# Navigate to stock-alert directory
|
|
cd /var/www/stock-alert.projectium.com
|
|
|
|
# Start processes (adjust config file name as needed)
|
|
pm2 start ecosystem.config.cjs
|
|
|
|
# Verify processes started
|
|
pm2 list
|
|
```
|
|
|
|
### Scenario D: All Processes Missing
|
|
|
|
Execute restoration in priority order:
|
|
|
|
```bash
|
|
# 1. Flyer Crawler Production (highest priority)
|
|
cd /var/www/flyer-crawler.projectium.com
|
|
pm2 start ecosystem.config.cjs
|
|
|
|
# Verify production is healthy before continuing
|
|
curl -s http://localhost:3001/api/health/ready | jq '.data.status'
|
|
# Should return "healthy"
|
|
|
|
# 2. Stock Alert Production
|
|
cd /var/www/stock-alert.projectium.com
|
|
pm2 start ecosystem.config.cjs
|
|
|
|
# 3. Flyer Crawler Test (lower priority)
|
|
cd /var/www/flyer-crawler-test.projectium.com
|
|
pm2 start ecosystem-test.config.cjs
|
|
|
|
# 4. Save PM2 process list
|
|
pm2 save
|
|
|
|
# 5. Final verification
|
|
pm2 list
|
|
```
|
|
|
|
### Health Check Verification
|
|
|
|
After restoration, verify each application:
|
|
|
|
**Flyer Crawler Production**:
|
|
|
|
```bash
|
|
# API health
|
|
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'
|
|
# Expected: "healthy"
|
|
|
|
# Check all services
|
|
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.services'
|
|
```
|
|
|
|
**Flyer Crawler Test**:
|
|
|
|
```bash
|
|
curl -s https://flyer-crawler-test.projectium.com/api/health/ready | jq '.data.status'
|
|
```
|
|
|
|
**Stock Alert**:
|
|
|
|
```bash
|
|
# Adjust URL as appropriate for stock-alert
|
|
curl -s https://stock-alert.projectium.com/api/health/ready | jq '.data.status'
|
|
```
|
|
|
|
### Verification Checklist
|
|
|
|
After restoration, confirm:
|
|
|
|
- [ ] `pm2 list` shows all expected processes as `online`
|
|
- [ ] Production health check returns `healthy`
|
|
- [ ] Test health check returns `healthy` (if applicable)
|
|
- [ ] No processes showing high restart count
|
|
- [ ] No processes showing `errored` or `stopped` status
|
|
- [ ] PM2 process list saved: `pm2 save`
|
|
|
|
---
|
|
|
|
## Root Cause Investigation
|
|
|
|
### Step 1: Check Workflow Execution Logs
|
|
|
|
```bash
|
|
# Find recent Gitea Actions runs
|
|
# (Access via Gitea web UI: Repository > Actions > Recent Runs)
|
|
|
|
# Look for these workflows:
|
|
# - deploy-to-prod.yml
|
|
# - deploy-to-test.yml
|
|
# - manual-deploy-major.yml
|
|
# - manual-db-restore.yml
|
|
```
|
|
|
|
### Step 2: Check PM2 Daemon Logs
|
|
|
|
```bash
|
|
# PM2 daemon logs
|
|
cat ~/.pm2/pm2.log | tail -100
|
|
|
|
# PM2 process-specific logs
|
|
ls -la ~/.pm2/logs/
|
|
|
|
# Recent API logs
|
|
tail -100 ~/.pm2/logs/flyer-crawler-api-out.log
|
|
tail -100 ~/.pm2/logs/flyer-crawler-api-error.log
|
|
```
|
|
|
|
### Step 3: Check System Logs
|
|
|
|
```bash
|
|
# System journal for PM2 service
|
|
journalctl -u pm2-gitea-runner -n 100 --no-pager
|
|
|
|
# Kernel messages (OOM killer, etc.)
|
|
journalctl -k -n 50 --no-pager | grep -i "killed\|oom\|memory"
|
|
|
|
# Authentication logs (unauthorized access)
|
|
tail -50 /var/log/auth.log
|
|
```
|
|
|
|
### Step 4: Git History Analysis
|
|
|
|
```bash
|
|
# Recent commits to deployment workflows
|
|
cd /var/www/flyer-crawler.projectium.com
|
|
git log --oneline -20 -- .gitea/workflows/
|
|
|
|
# Check what changed in PM2 configs
|
|
git log --oneline -10 -- ecosystem.config.cjs ecosystem-test.config.cjs
|
|
|
|
# Diff against last known good state
|
|
git diff <last-good-commit> -- .gitea/workflows/ ecosystem*.cjs
|
|
```
|
|
|
|
### Step 5: Timing Correlation
|
|
|
|
Create a timeline:
|
|
|
|
```text
|
|
| Time (UTC) | Event | Source |
|
|
|------------|-------|--------|
|
|
| XX:XX | Last successful health check | Monitoring |
|
|
| XX:XX | Deployment workflow started | Gitea Actions |
|
|
| XX:XX | First failed health check | Monitoring |
|
|
| XX:XX | Incident detected | User report / Alert |
|
|
| XX:XX | Investigation started | On-call |
|
|
```
|
|
|
|
### Common Root Causes
|
|
|
|
| Root Cause | Evidence | Prevention |
|
|
| ---------------------------- | -------------------------------------- | ---------------------------- |
|
|
| `pm2 stop all` in workflow | Workflow logs show "all" command | Use explicit process names |
|
|
| `pm2 delete all` in workflow | Empty PM2 list after deploy | Use whitelist-based deletion |
|
|
| OOM killer | `journalctl -k` shows "Killed process" | Increase memory limits |
|
|
| Disk space exhaustion | `df -h` shows 100% | Log rotation, cleanup |
|
|
| Manual intervention | Shell history shows pm2 commands | Document all manual actions |
|
|
| Concurrent deployments | Multiple workflows at same time | Implement deployment locks |
|
|
| Workflow caching issue | Old workflow version executed | Force workflow refresh |
|
|
|
|
---
|
|
|
|
## Communication Templates
|
|
|
|
### Incident Notification (Internal)
|
|
|
|
```text
|
|
Subject: [P1 INCIDENT] PM2 Process Isolation Failure - Multiple Apps Down
|
|
|
|
Status: INVESTIGATING
|
|
Time Detected: YYYY-MM-DD HH:MM UTC
|
|
Affected Systems: [flyer-crawler-prod, stock-alert-prod, ...]
|
|
|
|
Summary:
|
|
All PM2 processes on projectium.com server were terminated unexpectedly.
|
|
Multiple production applications are currently down.
|
|
|
|
Impact:
|
|
- flyer-crawler.projectium.com: DOWN
|
|
- stock-alert.projectium.com: DOWN
|
|
- [other affected apps]
|
|
|
|
Current Actions:
|
|
- Restoring critical production processes
|
|
- Investigating root cause
|
|
|
|
Next Update: In 15 minutes or upon status change
|
|
|
|
Incident Commander: [Name]
|
|
```
|
|
|
|
### Status Update Template
|
|
|
|
```text
|
|
Subject: [P1 INCIDENT] PM2 Process Isolation Failure - UPDATE #N
|
|
|
|
Status: [INVESTIGATING | IDENTIFIED | RESTORING | RESOLVED]
|
|
Time: YYYY-MM-DD HH:MM UTC
|
|
|
|
Progress Since Last Update:
|
|
- [Action taken]
|
|
- [Discovery made]
|
|
- [Process restored]
|
|
|
|
Current State:
|
|
- flyer-crawler.projectium.com: [UP|DOWN]
|
|
- stock-alert.projectium.com: [UP|DOWN]
|
|
|
|
Root Cause: [If identified]
|
|
|
|
Next Steps:
|
|
- [Planned action]
|
|
|
|
ETA to Resolution: [If known]
|
|
|
|
Next Update: In [X] minutes
|
|
```
|
|
|
|
### Resolution Notification
|
|
|
|
```text
|
|
Subject: [RESOLVED] PM2 Process Isolation Failure
|
|
|
|
Status: RESOLVED
|
|
Time Resolved: YYYY-MM-DD HH:MM UTC
|
|
Total Downtime: X minutes
|
|
|
|
Summary:
|
|
All PM2 processes have been restored. Services are operating normally.
|
|
|
|
Root Cause:
|
|
[Brief description of what caused the incident]
|
|
|
|
Impact Summary:
|
|
- flyer-crawler.projectium.com: Down for X minutes
|
|
- stock-alert.projectium.com: Down for X minutes
|
|
- Estimated user impact: [description]
|
|
|
|
Immediate Actions Taken:
|
|
1. [Action]
|
|
2. [Action]
|
|
|
|
Follow-up Actions:
|
|
1. [ ] [Preventive measure] - Owner: [Name] - Due: [Date]
|
|
2. [ ] Post-incident review scheduled for [Date]
|
|
|
|
Post-Incident Review: [Link or scheduled time]
|
|
```
|
|
|
|
---
|
|
|
|
## Prevention Measures
|
|
|
|
### Pre-Deployment Checklist
|
|
|
|
Before triggering any deployment:
|
|
|
|
- [ ] Review workflow file for PM2 commands
|
|
- [ ] Confirm no `pm2 stop all`, `pm2 delete all`, or `pm2 restart all`
|
|
- [ ] Verify process names are explicitly listed
|
|
- [ ] Check for concurrent deployment risks
|
|
- [ ] Confirm recent workflow changes were reviewed
|
|
|
|
### Workflow Review Checklist
|
|
|
|
When reviewing deployment workflow changes:
|
|
|
|
- [ ] All PM2 `stop` commands use explicit process names
|
|
- [ ] All PM2 `delete` commands filter by process name pattern
|
|
- [ ] All PM2 `restart` commands use explicit process names
|
|
- [ ] Test deployments filter by `-test` suffix
|
|
- [ ] Production deployments use whitelist array
|
|
|
|
**Safe Patterns**:
|
|
|
|
```javascript
|
|
// SAFE: Explicit process names (production)
|
|
const prodProcesses = [
|
|
'flyer-crawler-api',
|
|
'flyer-crawler-worker',
|
|
'flyer-crawler-analytics-worker',
|
|
];
|
|
list.forEach((p) => {
|
|
if (
|
|
(p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
|
|
prodProcesses.includes(p.name)
|
|
) {
|
|
exec('pm2 delete ' + p.pm2_env.pm_id);
|
|
}
|
|
});
|
|
|
|
// SAFE: Pattern-based filtering (test)
|
|
list.forEach((p) => {
|
|
if (p.name && p.name.endsWith('-test')) {
|
|
exec('pm2 delete ' + p.pm2_env.pm_id);
|
|
}
|
|
});
|
|
```
|
|
|
|
**Dangerous Patterns** (NEVER USE):
|
|
|
|
```bash
|
|
# DANGEROUS - affects ALL applications
|
|
pm2 stop all
|
|
pm2 delete all
|
|
pm2 restart all
|
|
|
|
# DANGEROUS - no name filtering
|
|
pm2 delete $(pm2 jlist | jq -r '.[] | select(.pm2_env.status == "errored") | .pm_id')
|
|
```
|
|
|
|
### PM2 Configuration Validation
|
|
|
|
Before deploying PM2 config changes:
|
|
|
|
```bash
|
|
# Test configuration locally
|
|
cd /var/www/flyer-crawler.projectium.com
|
|
node -e "console.log(JSON.stringify(require('./ecosystem.config.cjs'), null, 2))"
|
|
|
|
# Verify process names
|
|
node -e "require('./ecosystem.config.cjs').apps.forEach(a => console.log(a.name))"
|
|
|
|
# Expected output should match documented process names
|
|
```
|
|
|
|
### Deployment Monitoring
|
|
|
|
After every deployment:
|
|
|
|
```bash
|
|
# Immediate verification
|
|
pm2 list
|
|
|
|
# Check no unexpected processes were affected
|
|
pm2 list | grep -v flyer-crawler
|
|
# Should still show other apps (e.g., stock-alert)
|
|
|
|
# Health check
|
|
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'
|
|
```
|
|
|
|
---
|
|
|
|
## Contact Information
|
|
|
|
### On-Call Escalation
|
|
|
|
| Role | Contact | When to Escalate |
|
|
| ----------------- | -------------- | ----------------------------------- |
|
|
| Primary On-Call | [Name/Channel] | First responder |
|
|
| Secondary On-Call | [Name/Channel] | If primary unavailable after 10 min |
|
|
| Engineering Lead | [Name/Channel] | P1 incidents > 30 min |
|
|
| Product Owner | [Name/Channel] | User communication needed |
|
|
|
|
### External Dependencies
|
|
|
|
| Service | Support Channel | When to Contact |
|
|
| --------------- | --------------- | ----------------------- |
|
|
| Server Provider | [Contact info] | Hardware/network issues |
|
|
| DNS Provider | [Contact info] | DNS resolution failures |
|
|
| SSL Certificate | [Contact info] | Certificate issues |
|
|
|
|
### Communication Channels
|
|
|
|
| Channel | Purpose |
|
|
| -------------- | -------------------------- |
|
|
| `#incidents` | Real-time incident updates |
|
|
| `#deployments` | Deployment announcements |
|
|
| `#engineering` | Technical discussion |
|
|
| Email list | Formal notifications |
|
|
|
|
---
|
|
|
|
## Post-Incident Review
|
|
|
|
### Incident Report Template
|
|
|
|
```markdown
|
|
# Incident Report: [Title]
|
|
|
|
## Overview
|
|
|
|
| Field | Value |
|
|
| ------------------ | ----------------- |
|
|
| Date | YYYY-MM-DD |
|
|
| Duration | X hours Y minutes |
|
|
| Severity | P1/P2/P3 |
|
|
| Incident Commander | [Name] |
|
|
| Status | Resolved |
|
|
|
|
## Timeline
|
|
|
|
| Time (UTC) | Event |
|
|
| ---------- | ------------------- |
|
|
| HH:MM | [Event description] |
|
|
| HH:MM | [Event description] |
|
|
|
|
## Impact
|
|
|
|
- **Users affected**: [Number/description]
|
|
- **Revenue impact**: [If applicable]
|
|
- **SLA impact**: [If applicable]
|
|
|
|
## Root Cause
|
|
|
|
[Detailed technical explanation]
|
|
|
|
## Resolution
|
|
|
|
[What was done to resolve the incident]
|
|
|
|
## Contributing Factors
|
|
|
|
1. [Factor]
|
|
2. [Factor]
|
|
|
|
## Action Items
|
|
|
|
| Action | Owner | Due Date | Status |
|
|
| -------- | ------ | -------- | ------ |
|
|
| [Action] | [Name] | [Date] | [ ] |
|
|
|
|
## Lessons Learned
|
|
|
|
### What Went Well
|
|
|
|
- [Item]
|
|
|
|
### What Could Be Improved
|
|
|
|
- [Item]
|
|
|
|
## Appendix
|
|
|
|
- Link to monitoring data
|
|
- Link to relevant logs
|
|
- Link to workflow runs
|
|
```
|
|
|
|
### Lessons Learned Format
|
|
|
|
Use "5 Whys" technique:
|
|
|
|
```text
|
|
Problem: All PM2 processes were killed during deployment
|
|
|
|
Why 1: The deployment workflow ran `pm2 delete all`
|
|
Why 2: The workflow used an outdated version of the script
|
|
Why 3: Gitea runner cached the old workflow file
|
|
Why 4: No mechanism to verify workflow version before execution
|
|
Why 5: Workflow versioning and audit trail not implemented
|
|
|
|
Root Cause: Lack of workflow versioning and execution verification
|
|
|
|
Preventive Measure: Implement workflow hash logging and pre-execution verification
|
|
```
|
|
|
|
### Action Items Tracking
|
|
|
|
Create Gitea issues for each action item:
|
|
|
|
```bash
|
|
# Example using Gitea CLI or API
|
|
gh issue create --title "Implement PM2 state logging in deployment workflows" \
|
|
--body "Related to incident YYYY-MM-DD. Add pre-deployment PM2 state capture." \
|
|
--label "incident-follow-up,priority:high"
|
|
```
|
|
|
|
Track action items in a central location:
|
|
|
|
| Issue # | Action | Owner | Due | Status |
|
|
| ------- | -------------------------------- | ------ | ------ | ------ |
|
|
| #123 | Add PM2 state logging | [Name] | [Date] | Open |
|
|
| #124 | Implement workflow version hash | [Name] | [Date] | Open |
|
|
| #125 | Create deployment lock mechanism | [Name] | [Date] | Open |
|
|
|
|
---
|
|
|
|
## Appendix: PM2 Command Reference
|
|
|
|
### Safe Commands
|
|
|
|
```bash
|
|
# Status and monitoring
|
|
pm2 list
|
|
pm2 show <process-name>
|
|
pm2 monit
|
|
pm2 logs <process-name>
|
|
|
|
# Restart specific processes
|
|
pm2 restart flyer-crawler-api
|
|
pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker
|
|
|
|
# Reload (zero-downtime, cluster mode only)
|
|
pm2 reload flyer-crawler-api
|
|
|
|
# Start from config
|
|
pm2 start ecosystem.config.cjs
|
|
pm2 start ecosystem.config.cjs --only flyer-crawler-api
|
|
```
|
|
|
|
### Dangerous Commands (Use With Caution)
|
|
|
|
```bash
|
|
# CAUTION: These affect ALL processes
|
|
pm2 stop all # Stops every PM2 process
|
|
pm2 restart all # Restarts every PM2 process
|
|
pm2 delete all # Removes every PM2 process
|
|
|
|
# CAUTION: Modifies saved process list
|
|
pm2 save # Overwrites saved process list
|
|
pm2 resurrect # Restores from saved list
|
|
|
|
# CAUTION: Affects PM2 daemon
|
|
pm2 kill # Kills PM2 daemon and all processes
|
|
pm2 update # Updates PM2 in place (may cause brief outage)
|
|
```
|
|
|
|
---
|
|
|
|
## Revision History
|
|
|
|
| Date | Author | Change |
|
|
| ---------- | ---------------------- | ------------------------ |
|
|
| 2026-02-17 | Incident Response Team | Initial runbook creation |
|