debug: add PM2 crash debugging tools
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Has been cancelled

This commit is contained in:
2026-02-18 09:43:11 -08:00
parent cf2cc5b832
commit cd8ee92813
4 changed files with 605 additions and 6 deletions

View File

@@ -0,0 +1,278 @@
# PM2 Crash Debugging Guide
## Overview
This guide helps diagnose PM2 daemon crashes and identify which project is causing the issue.
## Common Symptoms
1. **PM2 processes disappear** between deployments
2. **`ENOENT: no such file or directory, uv_cwd`** errors in PM2 logs
3. **Processes require `pm2 resurrect`** after deployments
4. **PM2 daemon restarts** unexpectedly
## Root Cause
PM2 processes crash when their working directory (CWD) is deleted or modified while they're running. This typically happens when:
1. **rsync --delete** removes/recreates directories while processes are active
2. **npm install** modifies node_modules while processes are using them
3. **Deployments** don't stop processes before file operations
## Debugging Tools
### 1. PM2 Diagnostics Workflow
Run the comprehensive diagnostics workflow:
```bash
# In Gitea Actions UI:
# 1. Go to Actions → "PM2 Diagnostics"
# 2. Click "Run workflow"
# 3. Choose monitoring duration (default: 60s)
```
This workflow captures:
- Current PM2 state
- Working directory validation
- PM2 daemon logs
- All PM2-managed projects
- Crash patterns
- Deployment script analysis
### 2. PM2 Crash Analysis Script
Run the crash analysis script on the server:
```bash
# SSH to server
ssh gitea-runner@projectium.com
# Run analysis
cd /var/www/flyer-crawler.projectium.com
bash scripts/analyze-pm2-crashes.sh
# Or save to file
bash scripts/analyze-pm2-crashes.sh > pm2-crash-report.txt
```
### 3. Manual PM2 Inspection
Quick manual checks:
```bash
# Current PM2 state
pm2 list
# Detailed JSON state
pm2 jlist | jq '.'
# Check for missing CWDs
pm2 jlist | jq -r '.[] | "\(.name): \(.pm2_env.pm_cwd)"' | while read line; do
PROC=$(echo "$line" | cut -d: -f1)
CWD=$(echo "$line" | cut -d: -f2- | xargs)
[ -d "$CWD" ] && echo "$PROC" || echo "$PROC (CWD missing: $CWD)"
done
# View PM2 daemon log
tail -100 ~/.pm2/pm2.log
# Search for ENOENT errors
grep -i "ENOENT\|uv_cwd" ~/.pm2/pm2.log
```
## Identifying the Problematic Project
### Check Which Projects Share PM2 Daemon
```bash
pm2 list
# Group by project
pm2 jlist | jq -r '.[] | .name' | grep -oE "^[a-z-]+" | sort -u
```
**Projects on projectium.com:**
- `flyer-crawler` (production, test)
- `stock-alert` (production, test)
- Others?
### Check Deployment Timing
1. Review PM2 daemon restart times:
```bash
grep "New PM2 Daemon started" ~/.pm2/pm2.log
```
2. Compare with deployment times in Gitea Actions
3. Identify which deployment triggered the crash
### Check Deployment Scripts
For each project, check if deployment stops PM2 before rsync:
```bash
# Flyer-crawler
cat /var/www/flyer-crawler.projectium.com/.gitea/workflows/deploy-to-prod.yml | grep -B5 -A5 "rsync.*--delete"
# Stock-alert
cat /var/www/stock-alert.projectium.com/.gitea/workflows/deploy-to-prod.yml | grep -B5 -A5 "rsync.*--delete"
```
**Look for:**
- ❌ `rsync --delete` **before** `pm2 stop`
- ✅ `pm2 stop` **before** `rsync --delete`
## Common Culprits
### 1. Flyer-Crawler Deployments
**Before Fix:**
```yaml
# ❌ BAD - Deploys files while processes running
- name: Deploy Application
run: |
rsync --delete ./ /var/www/...
pm2 restart ...
```
**After Fix:**
```yaml
# ✅ GOOD - Stops processes first
- name: Deploy Application
run: |
pm2 stop flyer-crawler-api flyer-crawler-worker
rsync --delete ./ /var/www/...
pm2 startOrReload ...
```
### 2. Stock-Alert Deployments
Check if stock-alert follows the same pattern. If it deploys without stopping PM2, it could crash the shared PM2 daemon.
### 3. Cross-Project Interference
If multiple projects share PM2:
- One project's deployment can crash another project's processes
- The crashed project's processes lose their CWD
- PM2 daemon may restart, clearing all processes
## Solutions
### Immediate Fix (Manual)
```bash
# Restore processes from dump file
pm2 resurrect
# Verify all processes are running
pm2 list
```
### Permanent Fix
1. **Update deployment workflows** to stop PM2 before file operations
2. **Isolate PM2 daemons** by user or namespace
3. **Monitor deployments** to ensure proper sequencing
## Deployment Workflow Template
**Correct sequence:**
```yaml
- name: Deploy Application
run: |
# 1. STOP PROCESSES FIRST
pm2 stop my-api my-worker
# 2. THEN deploy files
rsync -avz --delete ./ /var/www/my-app/
# 3. Install dependencies (safe, no processes running)
cd /var/www/my-app
npm install --omit=dev
# 4. Clean up errored processes
pm2 delete my-api my-worker || true
# 5. START processes
pm2 startOrReload ecosystem.config.cjs
pm2 save
```
## Monitoring & Prevention
### Enable Verbose Logging
Enhanced deployment logging (already implemented in flyer-crawler):
```yaml
- name: Deploy Application
run: |
set -x # Command tracing
echo "Step 1: Stopping PM2..."
pm2 stop ...
pm2 list # Verify stopped
echo "Step 2: Deploying files..."
rsync --delete ...
echo "Step 3: Starting PM2..."
pm2 start ...
pm2 list # Verify started
```
### Regular Health Checks
```bash
# Add to cron or monitoring system
*/5 * * * * pm2 jlist | jq -r '.[] | select(.pm2_env.status != "online") | "ALERT: \(.name) is \(.pm2_env.status)"'
```
## Troubleshooting Decision Tree
```
PM2 processes missing?
├─ YES → Run `pm2 resurrect`
│ └─ Check PM2 daemon log for ENOENT errors
│ ├─ ENOENT found → Working directory deleted during deployment
│ │ └─ Fix: Add `pm2 stop` before rsync
│ └─ No ENOENT → Check other error patterns
└─ NO → Processes running but unstable?
└─ Check restart counts: `pm2 jlist | jq '.[].pm2_env.restart_time'`
└─ High restarts → Application-level issue (not PM2 crash)
```
## Related Documentation
- [PM2 Process Isolation Requirements](../../CLAUDE.md#pm2-process-isolation-productiontest-servers)
- [PM2 Incident Response Runbook](./PM2-INCIDENT-RESPONSE.md)
- [Incident Report 2026-02-17](./INCIDENT-2026-02-17-PM2-PROCESS-KILL.md)
## Quick Reference Commands
```bash
# Diagnose
pm2 list # Current state
pm2 jlist | jq '.' # Detailed JSON
tail -100 ~/.pm2/pm2.log # Recent logs
grep ENOENT ~/.pm2/pm2.log # Find crashes
# Fix
pm2 resurrect # Restore from dump
pm2 restart all # Restart everything
pm2 save # Save current state
# Analyze
bash scripts/analyze-pm2-crashes.sh # Run analysis script
pm2 jlist | jq -r '.[].pm2_env.pm_cwd' # Check working dirs
```