Files
flyer-crawler.projectium.com/docs/operations/PM2-CRASH-DEBUGGING.md
Torben Sorensen cd8ee92813
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Has been cancelled
debug: add PM2 crash debugging tools
2026-02-18 09:43:40 -08:00

6.6 KiB

PM2 Crash Debugging Guide

Overview

This guide helps diagnose PM2 daemon crashes and identify which project is causing the issue.

Common Symptoms

  1. PM2 processes disappear between deployments
  2. ENOENT: no such file or directory, uv_cwd errors in PM2 logs
  3. Processes require pm2 resurrect after deployments
  4. PM2 daemon restarts unexpectedly

Root Cause

PM2 processes crash when their working directory (CWD) is deleted or modified while they're running. This typically happens when:

  1. rsync --delete removes/recreates directories while processes are active
  2. npm install modifies node_modules while processes are using them
  3. Deployments don't stop processes before file operations

Debugging Tools

1. PM2 Diagnostics Workflow

Run the comprehensive diagnostics workflow:

# In Gitea Actions UI:
# 1. Go to Actions → "PM2 Diagnostics"
# 2. Click "Run workflow"
# 3. Choose monitoring duration (default: 60s)

This workflow captures:

  • Current PM2 state
  • Working directory validation
  • PM2 daemon logs
  • All PM2-managed projects
  • Crash patterns
  • Deployment script analysis

2. PM2 Crash Analysis Script

Run the crash analysis script on the server:

# SSH to server
ssh gitea-runner@projectium.com

# Run analysis
cd /var/www/flyer-crawler.projectium.com
bash scripts/analyze-pm2-crashes.sh

# Or save to file
bash scripts/analyze-pm2-crashes.sh > pm2-crash-report.txt

3. Manual PM2 Inspection

Quick manual checks:

# Current PM2 state
pm2 list

# Detailed JSON state
pm2 jlist | jq '.'

# Check for missing CWDs
pm2 jlist | jq -r '.[] | "\(.name): \(.pm2_env.pm_cwd)"' | while read line; do
    PROC=$(echo "$line" | cut -d: -f1)
    CWD=$(echo "$line" | cut -d: -f2- | xargs)
    [ -d "$CWD" ] && echo "✅ $PROC" || echo "❌ $PROC (CWD missing: $CWD)"
done

# View PM2 daemon log
tail -100 ~/.pm2/pm2.log

# Search for ENOENT errors
grep -i "ENOENT\|uv_cwd" ~/.pm2/pm2.log

Identifying the Problematic Project

Check Which Projects Share PM2 Daemon

pm2 list

# Group by project
pm2 jlist | jq -r '.[] | .name' | grep -oE "^[a-z-]+" | sort -u

Projects on projectium.com:

  • flyer-crawler (production, test)
  • stock-alert (production, test)
  • Others?

Check Deployment Timing

  1. Review PM2 daemon restart times:

    grep "New PM2 Daemon started" ~/.pm2/pm2.log
    
  2. Compare with deployment times in Gitea Actions

  3. Identify which deployment triggered the crash

Check Deployment Scripts

For each project, check if deployment stops PM2 before rsync:

# Flyer-crawler
cat /var/www/flyer-crawler.projectium.com/.gitea/workflows/deploy-to-prod.yml | grep -B5 -A5 "rsync.*--delete"

# Stock-alert
cat /var/www/stock-alert.projectium.com/.gitea/workflows/deploy-to-prod.yml | grep -B5 -A5 "rsync.*--delete"

Look for:

  • rsync --delete before pm2 stop
  • pm2 stop before rsync --delete

Common Culprits

1. Flyer-Crawler Deployments

Before Fix:

# ❌ BAD - Deploys files while processes running
- name: Deploy Application
  run: |
    rsync --delete ./ /var/www/...
    pm2 restart ...

After Fix:

# ✅ GOOD - Stops processes first
- name: Deploy Application
  run: |
    pm2 stop flyer-crawler-api flyer-crawler-worker
    rsync --delete ./ /var/www/...
    pm2 startOrReload ...

2. Stock-Alert Deployments

Check if stock-alert follows the same pattern. If it deploys without stopping PM2, it could crash the shared PM2 daemon.

3. Cross-Project Interference

If multiple projects share PM2:

  • One project's deployment can crash another project's processes
  • The crashed project's processes lose their CWD
  • PM2 daemon may restart, clearing all processes

Solutions

Immediate Fix (Manual)

# Restore processes from dump file
pm2 resurrect

# Verify all processes are running
pm2 list

Permanent Fix

  1. Update deployment workflows to stop PM2 before file operations
  2. Isolate PM2 daemons by user or namespace
  3. Monitor deployments to ensure proper sequencing

Deployment Workflow Template

Correct sequence:

- name: Deploy Application
  run: |
    # 1. STOP PROCESSES FIRST
    pm2 stop my-api my-worker

    # 2. THEN deploy files
    rsync -avz --delete ./ /var/www/my-app/

    # 3. Install dependencies (safe, no processes running)
    cd /var/www/my-app
    npm install --omit=dev

    # 4. Clean up errored processes
    pm2 delete my-api my-worker || true

    # 5. START processes
    pm2 startOrReload ecosystem.config.cjs
    pm2 save

Monitoring & Prevention

Enable Verbose Logging

Enhanced deployment logging (already implemented in flyer-crawler):

- name: Deploy Application
  run: |
    set -x  # Command tracing
    echo "Step 1: Stopping PM2..."
    pm2 stop ...
    pm2 list  # Verify stopped

    echo "Step 2: Deploying files..."
    rsync --delete ...

    echo "Step 3: Starting PM2..."
    pm2 start ...
    pm2 list  # Verify started

Regular Health Checks

# Add to cron or monitoring system
*/5 * * * * pm2 jlist | jq -r '.[] | select(.pm2_env.status != "online") | "ALERT: \(.name) is \(.pm2_env.status)"'

Troubleshooting Decision Tree

PM2 processes missing?
├─ YES → Run `pm2 resurrect`
│        └─ Check PM2 daemon log for ENOENT errors
│           ├─ ENOENT found → Working directory deleted during deployment
│           │                 └─ Fix: Add `pm2 stop` before rsync
│           └─ No ENOENT → Check other error patterns
│
└─ NO → Processes running but unstable?
         └─ Check restart counts: `pm2 jlist | jq '.[].pm2_env.restart_time'`
            └─ High restarts → Application-level issue (not PM2 crash)

Quick Reference Commands

# Diagnose
pm2 list                                    # Current state
pm2 jlist | jq '.'                          # Detailed JSON
tail -100 ~/.pm2/pm2.log                    # Recent logs
grep ENOENT ~/.pm2/pm2.log                  # Find crashes

# Fix
pm2 resurrect                               # Restore from dump
pm2 restart all                             # Restart everything
pm2 save                                    # Save current state

# Analyze
bash scripts/analyze-pm2-crashes.sh         # Run analysis script
pm2 jlist | jq -r '.[].pm2_env.pm_cwd'     # Check working dirs