flyer-crawler.projectium.com/docs/adr/0061-pm2-process-isolation-safeguards.md

# ADR-061: PM2 Process Isolation Safeguards

## Status

Accepted

## Context

On 2026-02-17, a critical incident occurred during v0.15.0 production deployment where ALL PM2 processes on the production server were terminated, not just flyer-crawler processes. This caused unplanned downtime for multiple applications including `stock-alert.projectium.com`.

### Problem Statement

Production and test environments share the same PM2 daemon on the server. This creates a risk where deployment scripts that operate on PM2 processes can accidentally affect processes belonging to other applications or environments.

### Pre-existing Controls

Prior to the incident, PM2 process isolation controls were already in place (commit `b6a62a0`):

- Production workflows used whitelist-based filtering with explicit process names
- Test workflows filtered by `-test` suffix pattern
- CLAUDE.md documented the prohibition of `pm2 stop all`, `pm2 delete all`, and `pm2 restart all`

Despite these controls being present in the codebase and included in v0.15.0, the incident still occurred. The leading hypothesis is that the Gitea runner executed a cached/older version of the workflow file.

### Requirements

1. Prevent accidental deletion of processes from other applications or environments
2. Provide audit trail for forensic analysis when incidents occur
3. Enable automatic abort when dangerous conditions are detected
4. Maintain visibility into PM2 operations during deployment
5. Work correctly even if the filtering logic itself is bypassed

## Decision

Implement a defense-in-depth strategy with 5 layers of safeguards in all deployment workflows that interact with PM2 processes.

### Safeguard Layers

#### Layer 1: Workflow Metadata Logging

Log workflow execution metadata at the start of each deployment:

```bash
echo "=== WORKFLOW METADATA ==="
echo "Workflow file: deploy-to-prod.yml"
echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
echo "Git commit: $(git rev-parse HEAD)"
echo "Git branch: $(git rev-parse --abbrev-ref HEAD)"
echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo "Actor: ${{ gitea.actor }}"
echo "=== END METADATA ==="
```

**Purpose**: Enables verification of which workflow version was actually executed.

#### Layer 2: Pre-Cleanup PM2 State Logging

Capture full PM2 process list before any modifications:

```bash
echo "=== PRE-CLEANUP PM2 STATE ==="
pm2 jlist
echo "=== END PRE-CLEANUP STATE ==="
```

**Purpose**: Provides forensic evidence of system state before cleanup.

#### Layer 3: Process Count Validation (SAFETY ABORT)

Abort deployment if the filter would delete ALL processes and there are more than 3 processes total:

```javascript
const totalProcesses = list.length;
if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
  console.error('SAFETY ABORT: Filter would delete ALL processes!');
  console.error(
    'Total processes: ' + totalProcesses + ', Target processes: ' + targetProcesses.length,
  );
  process.exit(1);
}
```

**Purpose**: Catches filter bugs or unexpected conditions automatically.

**Threshold Rationale**: A threshold of 3 allows normal operation when only the expected processes exist (API, Worker, Analytics Worker) while catching anomalies when the server hosts additional applications.

#### Layer 4: Explicit Name Verification

Log the exact name, status, and PM2 ID of each process that will be affected:

```javascript
console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
targetProcesses.forEach((p) => {
  console.log(
    '  - ' + p.name + ' (status: ' + p.pm2_env.status + ', pm_id: ' + p.pm2_env.pm_id + ')',
  );
});
```

**Purpose**: Provides clear visibility into cleanup operations.

#### Layer 5: Post-Cleanup Verification

After cleanup, verify environment isolation was maintained:

```bash
echo "=== POST-CLEANUP VERIFICATION ==="
pm2 jlist | node -e "
  const list = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
  const prodProcesses = list.filter(p => p.name && p.name.startsWith('flyer-crawler-') && !p.name.endsWith('-test'));
  console.log('Production processes after cleanup: ' + prodProcesses.length);
"
echo "=== END POST-CLEANUP VERIFICATION ==="
```

**Purpose**: Immediately identifies cross-environment contamination.

#### Layer 6: PM2 Process List Persistence

**CRITICAL**: Save the PM2 process list after every state-changing operation:

```bash
# After any pm2 start/stop/restart/delete operation
pm2 save

# Example: After cleanup loop completes
targetProcesses.forEach(p => {
  exec('pm2 delete ' + p.pm2_env.pm_id);
});
exec('pm2 save');  // Persist all deletions
```

**Purpose**: Ensures PM2 process state persists across daemon restarts, server reboots, and internal reconciliation events.

**Why This Matters**: PM2 maintains an in-memory process list. Without `pm2 save`, processes become ephemeral:

- Daemon restart → All unsaved processes disappear
- Server reboot → Process list reverts to last saved state
- PM2 internal reconciliation → Unsaved processes may be lost

**Pattern**: Every `pm2 start`, `pm2 restart`, `pm2 stop`, or `pm2 delete` MUST be followed by `pm2 save`.

## Consequences

### Positive

1. **Automatic Prevention**: Layer 3 (process count validation) can prevent catastrophic process deletion automatically, without human intervention.

2. **Forensic Capability**: Layers 1 and 2 provide the data needed to determine root cause after an incident.

3. **Visibility**: Layers 4 and 5 make PM2 operations transparent in workflow logs.

4. **Fail-Safe Design**: Even if individual layers fail, other layers provide backup protection.

5. **Non-Breaking**: Safeguards are additive and do not change the existing filtering logic.

### Negative

1. **Increased Log Volume**: Additional logging increases workflow output size.

2. **Minor Performance Impact**: Extra PM2 commands add a few seconds to deployment time.

3. **Threshold Tuning**: The threshold of 3 may need adjustment if the expected process count changes.

### Neutral

1. **Root Cause Still Unknown**: These safeguards mitigate the risk but do not definitively explain why the original incident occurred.

2. **No Structural Changes**: The underlying architecture (shared PM2 daemon) remains unchanged.

## Alternatives Considered

### PM2 Namespaces

PM2 supports namespaces to isolate groups of processes. This would provide complete isolation but requires:

- Changes to ecosystem config files
- Changes to all PM2 commands in workflows
- Potential breaking changes to monitoring and log aggregation

**Decision**: Deferred for future consideration. Current safeguards provide adequate protection.

### Separate PM2 Daemons

Running a separate PM2 daemon per application would eliminate cross-application risk entirely.

**Decision**: Not implemented due to increased operational complexity and the current safeguards being sufficient.

### Deployment Locks

Implementing mutex-style locks to prevent concurrent deployments could prevent race conditions.

**Decision**: Not implemented as the current safeguards address the identified risk. May be reconsidered if concurrent deployment issues are observed.

## Implementation

### Files Modified

| File                                       | Changes                |
| ------------------------------------------ | ---------------------- |
| `.gitea/workflows/deploy-to-prod.yml`      | All 5 safeguard layers |
| `.gitea/workflows/deploy-to-test.yml`      | All 5 safeguard layers |
| `.gitea/workflows/manual-deploy-major.yml` | All 5 safeguard layers |

### Validation

A standalone test file validates the safeguard logic:

- **File**: `tests/qa/test-pm2-safeguard-logic.js`
- **Coverage**: 11 scenarios covering normal operations and dangerous edge cases
- **Result**: All tests pass

## Related Documentation

- [Incident Report: 2026-02-17](../operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md)
- [PM2 Incident Response Runbook](../operations/PM2-INCIDENT-RESPONSE.md)
- [Session Summary](../archive/sessions/PM2_SAFEGUARDS_SESSION_2026-02-17.md)
- [CLAUDE.md - PM2 Process Isolation](../../CLAUDE.md#pm2-process-isolation-productiontest-servers)
- [ADR-014: Containerization and Deployment Strategy](0014-containerization-and-deployment-strategy.md)

## References

- PM2 Documentation: https://pm2.keymetrics.io/docs/usage/application-declaration/
- Defense in Depth: https://en.wikipedia.org/wiki/Defense_in_depth_(computing)