debug: add PM2 crash debugging tools
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Has been cancelled
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Has been cancelled
This commit is contained in:
278
docs/operations/PM2-CRASH-DEBUGGING.md
Normal file
278
docs/operations/PM2-CRASH-DEBUGGING.md
Normal file
@@ -0,0 +1,278 @@
|
||||
# PM2 Crash Debugging Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This guide helps diagnose PM2 daemon crashes and identify which project is causing the issue.
|
||||
|
||||
## Common Symptoms
|
||||
|
||||
1. **PM2 processes disappear** between deployments
|
||||
2. **`ENOENT: no such file or directory, uv_cwd`** errors in PM2 logs
|
||||
3. **Processes require `pm2 resurrect`** after deployments
|
||||
4. **PM2 daemon restarts** unexpectedly
|
||||
|
||||
## Root Cause
|
||||
|
||||
PM2 processes crash when their working directory (CWD) is deleted or modified while they're running. This typically happens when:
|
||||
|
||||
1. **rsync --delete** removes/recreates directories while processes are active
|
||||
2. **npm install** modifies node_modules while processes are using them
|
||||
3. **Deployments** don't stop processes before file operations
|
||||
|
||||
## Debugging Tools
|
||||
|
||||
### 1. PM2 Diagnostics Workflow
|
||||
|
||||
Run the comprehensive diagnostics workflow:
|
||||
|
||||
```bash
|
||||
# In Gitea Actions UI:
|
||||
# 1. Go to Actions → "PM2 Diagnostics"
|
||||
# 2. Click "Run workflow"
|
||||
# 3. Choose monitoring duration (default: 60s)
|
||||
```
|
||||
|
||||
This workflow captures:
|
||||
|
||||
- Current PM2 state
|
||||
- Working directory validation
|
||||
- PM2 daemon logs
|
||||
- All PM2-managed projects
|
||||
- Crash patterns
|
||||
- Deployment script analysis
|
||||
|
||||
### 2. PM2 Crash Analysis Script
|
||||
|
||||
Run the crash analysis script on the server:
|
||||
|
||||
```bash
|
||||
# SSH to server
|
||||
ssh gitea-runner@projectium.com
|
||||
|
||||
# Run analysis
|
||||
cd /var/www/flyer-crawler.projectium.com
|
||||
bash scripts/analyze-pm2-crashes.sh
|
||||
|
||||
# Or save to file
|
||||
bash scripts/analyze-pm2-crashes.sh > pm2-crash-report.txt
|
||||
```
|
||||
|
||||
### 3. Manual PM2 Inspection
|
||||
|
||||
Quick manual checks:
|
||||
|
||||
```bash
|
||||
# Current PM2 state
|
||||
pm2 list
|
||||
|
||||
# Detailed JSON state
|
||||
pm2 jlist | jq '.'
|
||||
|
||||
# Check for missing CWDs
|
||||
pm2 jlist | jq -r '.[] | "\(.name): \(.pm2_env.pm_cwd)"' | while read line; do
|
||||
PROC=$(echo "$line" | cut -d: -f1)
|
||||
CWD=$(echo "$line" | cut -d: -f2- | xargs)
|
||||
[ -d "$CWD" ] && echo "✅ $PROC" || echo "❌ $PROC (CWD missing: $CWD)"
|
||||
done
|
||||
|
||||
# View PM2 daemon log
|
||||
tail -100 ~/.pm2/pm2.log
|
||||
|
||||
# Search for ENOENT errors
|
||||
grep -i "ENOENT\|uv_cwd" ~/.pm2/pm2.log
|
||||
```
|
||||
|
||||
## Identifying the Problematic Project
|
||||
|
||||
### Check Which Projects Share PM2 Daemon
|
||||
|
||||
```bash
|
||||
pm2 list
|
||||
|
||||
# Group by project
|
||||
pm2 jlist | jq -r '.[] | .name' | grep -oE "^[a-z-]+" | sort -u
|
||||
```
|
||||
|
||||
**Projects on projectium.com:**
|
||||
|
||||
- `flyer-crawler` (production, test)
|
||||
- `stock-alert` (production, test)
|
||||
- Others?
|
||||
|
||||
### Check Deployment Timing
|
||||
|
||||
1. Review PM2 daemon restart times:
|
||||
|
||||
```bash
|
||||
grep "New PM2 Daemon started" ~/.pm2/pm2.log
|
||||
```
|
||||
|
||||
2. Compare with deployment times in Gitea Actions
|
||||
|
||||
3. Identify which deployment triggered the crash
|
||||
|
||||
### Check Deployment Scripts
|
||||
|
||||
For each project, check if deployment stops PM2 before rsync:
|
||||
|
||||
```bash
|
||||
# Flyer-crawler
|
||||
cat /var/www/flyer-crawler.projectium.com/.gitea/workflows/deploy-to-prod.yml | grep -B5 -A5 "rsync.*--delete"
|
||||
|
||||
# Stock-alert
|
||||
cat /var/www/stock-alert.projectium.com/.gitea/workflows/deploy-to-prod.yml | grep -B5 -A5 "rsync.*--delete"
|
||||
```
|
||||
|
||||
**Look for:**
|
||||
|
||||
- ❌ `rsync --delete` **before** `pm2 stop`
|
||||
- ✅ `pm2 stop` **before** `rsync --delete`
|
||||
|
||||
## Common Culprits
|
||||
|
||||
### 1. Flyer-Crawler Deployments
|
||||
|
||||
**Before Fix:**
|
||||
|
||||
```yaml
|
||||
# ❌ BAD - Deploys files while processes running
|
||||
- name: Deploy Application
|
||||
run: |
|
||||
rsync --delete ./ /var/www/...
|
||||
pm2 restart ...
|
||||
```
|
||||
|
||||
**After Fix:**
|
||||
|
||||
```yaml
|
||||
# ✅ GOOD - Stops processes first
|
||||
- name: Deploy Application
|
||||
run: |
|
||||
pm2 stop flyer-crawler-api flyer-crawler-worker
|
||||
rsync --delete ./ /var/www/...
|
||||
pm2 startOrReload ...
|
||||
```
|
||||
|
||||
### 2. Stock-Alert Deployments
|
||||
|
||||
Check if stock-alert follows the same pattern. If it deploys without stopping PM2, it could crash the shared PM2 daemon.
|
||||
|
||||
### 3. Cross-Project Interference
|
||||
|
||||
If multiple projects share PM2:
|
||||
|
||||
- One project's deployment can crash another project's processes
|
||||
- The crashed project's processes lose their CWD
|
||||
- PM2 daemon may restart, clearing all processes
|
||||
|
||||
## Solutions
|
||||
|
||||
### Immediate Fix (Manual)
|
||||
|
||||
```bash
|
||||
# Restore processes from dump file
|
||||
pm2 resurrect
|
||||
|
||||
# Verify all processes are running
|
||||
pm2 list
|
||||
```
|
||||
|
||||
### Permanent Fix
|
||||
|
||||
1. **Update deployment workflows** to stop PM2 before file operations
|
||||
2. **Isolate PM2 daemons** by user or namespace
|
||||
3. **Monitor deployments** to ensure proper sequencing
|
||||
|
||||
## Deployment Workflow Template
|
||||
|
||||
**Correct sequence:**
|
||||
|
||||
```yaml
|
||||
- name: Deploy Application
|
||||
run: |
|
||||
# 1. STOP PROCESSES FIRST
|
||||
pm2 stop my-api my-worker
|
||||
|
||||
# 2. THEN deploy files
|
||||
rsync -avz --delete ./ /var/www/my-app/
|
||||
|
||||
# 3. Install dependencies (safe, no processes running)
|
||||
cd /var/www/my-app
|
||||
npm install --omit=dev
|
||||
|
||||
# 4. Clean up errored processes
|
||||
pm2 delete my-api my-worker || true
|
||||
|
||||
# 5. START processes
|
||||
pm2 startOrReload ecosystem.config.cjs
|
||||
pm2 save
|
||||
```
|
||||
|
||||
## Monitoring & Prevention
|
||||
|
||||
### Enable Verbose Logging
|
||||
|
||||
Enhanced deployment logging (already implemented in flyer-crawler):
|
||||
|
||||
```yaml
|
||||
- name: Deploy Application
|
||||
run: |
|
||||
set -x # Command tracing
|
||||
echo "Step 1: Stopping PM2..."
|
||||
pm2 stop ...
|
||||
pm2 list # Verify stopped
|
||||
|
||||
echo "Step 2: Deploying files..."
|
||||
rsync --delete ...
|
||||
|
||||
echo "Step 3: Starting PM2..."
|
||||
pm2 start ...
|
||||
pm2 list # Verify started
|
||||
```
|
||||
|
||||
### Regular Health Checks
|
||||
|
||||
```bash
|
||||
# Add to cron or monitoring system
|
||||
*/5 * * * * pm2 jlist | jq -r '.[] | select(.pm2_env.status != "online") | "ALERT: \(.name) is \(.pm2_env.status)"'
|
||||
```
|
||||
|
||||
## Troubleshooting Decision Tree
|
||||
|
||||
```
|
||||
PM2 processes missing?
|
||||
├─ YES → Run `pm2 resurrect`
|
||||
│ └─ Check PM2 daemon log for ENOENT errors
|
||||
│ ├─ ENOENT found → Working directory deleted during deployment
|
||||
│ │ └─ Fix: Add `pm2 stop` before rsync
|
||||
│ └─ No ENOENT → Check other error patterns
|
||||
│
|
||||
└─ NO → Processes running but unstable?
|
||||
└─ Check restart counts: `pm2 jlist | jq '.[].pm2_env.restart_time'`
|
||||
└─ High restarts → Application-level issue (not PM2 crash)
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [PM2 Process Isolation Requirements](../../CLAUDE.md#pm2-process-isolation-productiontest-servers)
|
||||
- [PM2 Incident Response Runbook](./PM2-INCIDENT-RESPONSE.md)
|
||||
- [Incident Report 2026-02-17](./INCIDENT-2026-02-17-PM2-PROCESS-KILL.md)
|
||||
|
||||
## Quick Reference Commands
|
||||
|
||||
```bash
|
||||
# Diagnose
|
||||
pm2 list # Current state
|
||||
pm2 jlist | jq '.' # Detailed JSON
|
||||
tail -100 ~/.pm2/pm2.log # Recent logs
|
||||
grep ENOENT ~/.pm2/pm2.log # Find crashes
|
||||
|
||||
# Fix
|
||||
pm2 resurrect # Restore from dump
|
||||
pm2 restart all # Restart everything
|
||||
pm2 save # Save current state
|
||||
|
||||
# Analyze
|
||||
bash scripts/analyze-pm2-crashes.sh # Run analysis script
|
||||
pm2 jlist | jq -r '.[].pm2_env.pm_cwd' # Check working dirs
|
||||
```
|
||||
Reference in New Issue
Block a user