torbo/flyer-crawler.projectium.com

Fork 0

Files

Torben Sorensen c059b30201

Deploy to Test Environment / deploy-to-test (push) Successful in 30m15s

Details

PM2 Process Isolation

2026-02-17 20:49:01 -08:00

7.4 KiB

Raw Blame History

ADR-061: PM2 Process Isolation Safeguards

Status

Accepted

Context

On 2026-02-17, a critical incident occurred during v0.15.0 production deployment where ALL PM2 processes on the production server were terminated, not just flyer-crawler processes. This caused unplanned downtime for multiple applications including stock-alert.projectium.com.

Problem Statement

Production and test environments share the same PM2 daemon on the server. This creates a risk where deployment scripts that operate on PM2 processes can accidentally affect processes belonging to other applications or environments.

Pre-existing Controls

Prior to the incident, PM2 process isolation controls were already in place (commit b6a62a0):

Production workflows used whitelist-based filtering with explicit process names
Test workflows filtered by -test suffix pattern
CLAUDE.md documented the prohibition of pm2 stop all, pm2 delete all, and pm2 restart all

Despite these controls being present in the codebase and included in v0.15.0, the incident still occurred. The leading hypothesis is that the Gitea runner executed a cached/older version of the workflow file.

Requirements

Prevent accidental deletion of processes from other applications or environments
Provide audit trail for forensic analysis when incidents occur
Enable automatic abort when dangerous conditions are detected
Maintain visibility into PM2 operations during deployment
Work correctly even if the filtering logic itself is bypassed

Decision

Implement a defense-in-depth strategy with 5 layers of safeguards in all deployment workflows that interact with PM2 processes.

Safeguard Layers

Layer 1: Workflow Metadata Logging

Log workflow execution metadata at the start of each deployment:

echo "=== WORKFLOW METADATA ==="
echo "Workflow file: deploy-to-prod.yml"
echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
echo "Git commit: $(git rev-parse HEAD)"
echo "Git branch: $(git rev-parse --abbrev-ref HEAD)"
echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo "Actor: ${{ gitea.actor }}"
echo "=== END METADATA ==="

Purpose: Enables verification of which workflow version was actually executed.

Layer 2: Pre-Cleanup PM2 State Logging

Capture full PM2 process list before any modifications:

echo "=== PRE-CLEANUP PM2 STATE ==="
pm2 jlist
echo "=== END PRE-CLEANUP STATE ==="

Purpose: Provides forensic evidence of system state before cleanup.

Layer 3: Process Count Validation (SAFETY ABORT)

Abort deployment if the filter would delete ALL processes and there are more than 3 processes total:

const totalProcesses = list.length;
if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
  console.error('SAFETY ABORT: Filter would delete ALL processes!');
  console.error(
    'Total processes: ' + totalProcesses + ', Target processes: ' + targetProcesses.length,
  );
  process.exit(1);
}

Purpose: Catches filter bugs or unexpected conditions automatically.

Threshold Rationale: A threshold of 3 allows normal operation when only the expected processes exist (API, Worker, Analytics Worker) while catching anomalies when the server hosts additional applications.

Layer 4: Explicit Name Verification

Log the exact name, status, and PM2 ID of each process that will be affected:

console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
targetProcesses.forEach((p) => {
  console.log(
    '  - ' + p.name + ' (status: ' + p.pm2_env.status + ', pm_id: ' + p.pm2_env.pm_id + ')',
  );
});

Purpose: Provides clear visibility into cleanup operations.

Layer 5: Post-Cleanup Verification

After cleanup, verify environment isolation was maintained:

echo "=== POST-CLEANUP VERIFICATION ==="
pm2 jlist | node -e "
  const list = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
  const prodProcesses = list.filter(p => p.name && p.name.startsWith('flyer-crawler-') && !p.name.endsWith('-test'));
  console.log('Production processes after cleanup: ' + prodProcesses.length);
"
echo "=== END POST-CLEANUP VERIFICATION ==="

Purpose: Immediately identifies cross-environment contamination.

Consequences

Positive

Automatic Prevention: Layer 3 (process count validation) can prevent catastrophic process deletion automatically, without human intervention.
Forensic Capability: Layers 1 and 2 provide the data needed to determine root cause after an incident.
Visibility: Layers 4 and 5 make PM2 operations transparent in workflow logs.
Fail-Safe Design: Even if individual layers fail, other layers provide backup protection.
Non-Breaking: Safeguards are additive and do not change the existing filtering logic.

Negative

Increased Log Volume: Additional logging increases workflow output size.
Minor Performance Impact: Extra PM2 commands add a few seconds to deployment time.
Threshold Tuning: The threshold of 3 may need adjustment if the expected process count changes.

Neutral

Root Cause Still Unknown: These safeguards mitigate the risk but do not definitively explain why the original incident occurred.
No Structural Changes: The underlying architecture (shared PM2 daemon) remains unchanged.

Alternatives Considered

PM2 Namespaces

PM2 supports namespaces to isolate groups of processes. This would provide complete isolation but requires:

Changes to ecosystem config files
Changes to all PM2 commands in workflows
Potential breaking changes to monitoring and log aggregation

Decision: Deferred for future consideration. Current safeguards provide adequate protection.

Separate PM2 Daemons

Running a separate PM2 daemon per application would eliminate cross-application risk entirely.

Decision: Not implemented due to increased operational complexity and the current safeguards being sufficient.

Deployment Locks

Implementing mutex-style locks to prevent concurrent deployments could prevent race conditions.

Decision: Not implemented as the current safeguards address the identified risk. May be reconsidered if concurrent deployment issues are observed.

Implementation

Files Modified

File	Changes
`.gitea/workflows/deploy-to-prod.yml`	All 5 safeguard layers
`.gitea/workflows/deploy-to-test.yml`	All 5 safeguard layers
`.gitea/workflows/manual-deploy-major.yml`	All 5 safeguard layers

Validation

A standalone test file validates the safeguard logic:

File: tests/qa/test-pm2-safeguard-logic.js
Coverage: 11 scenarios covering normal operations and dangerous edge cases
Result: All tests pass

References

PM2 Documentation: https://pm2.keymetrics.io/docs/usage/application-declaration/
Defense in Depth: https://en.wikipedia.org/wiki/Defense_in_depth_(computing)

7.4 KiB Raw Blame History