7.4 KiB
ADR-061: PM2 Process Isolation Safeguards
Status
Accepted
Context
On 2026-02-17, a critical incident occurred during v0.15.0 production deployment where ALL PM2 processes on the production server were terminated, not just flyer-crawler processes. This caused unplanned downtime for multiple applications including stock-alert.projectium.com.
Problem Statement
Production and test environments share the same PM2 daemon on the server. This creates a risk where deployment scripts that operate on PM2 processes can accidentally affect processes belonging to other applications or environments.
Pre-existing Controls
Prior to the incident, PM2 process isolation controls were already in place (commit b6a62a0):
- Production workflows used whitelist-based filtering with explicit process names
- Test workflows filtered by
-testsuffix pattern - CLAUDE.md documented the prohibition of
pm2 stop all,pm2 delete all, andpm2 restart all
Despite these controls being present in the codebase and included in v0.15.0, the incident still occurred. The leading hypothesis is that the Gitea runner executed a cached/older version of the workflow file.
Requirements
- Prevent accidental deletion of processes from other applications or environments
- Provide audit trail for forensic analysis when incidents occur
- Enable automatic abort when dangerous conditions are detected
- Maintain visibility into PM2 operations during deployment
- Work correctly even if the filtering logic itself is bypassed
Decision
Implement a defense-in-depth strategy with 5 layers of safeguards in all deployment workflows that interact with PM2 processes.
Safeguard Layers
Layer 1: Workflow Metadata Logging
Log workflow execution metadata at the start of each deployment:
echo "=== WORKFLOW METADATA ==="
echo "Workflow file: deploy-to-prod.yml"
echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
echo "Git commit: $(git rev-parse HEAD)"
echo "Git branch: $(git rev-parse --abbrev-ref HEAD)"
echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo "Actor: ${{ gitea.actor }}"
echo "=== END METADATA ==="
Purpose: Enables verification of which workflow version was actually executed.
Layer 2: Pre-Cleanup PM2 State Logging
Capture full PM2 process list before any modifications:
echo "=== PRE-CLEANUP PM2 STATE ==="
pm2 jlist
echo "=== END PRE-CLEANUP STATE ==="
Purpose: Provides forensic evidence of system state before cleanup.
Layer 3: Process Count Validation (SAFETY ABORT)
Abort deployment if the filter would delete ALL processes and there are more than 3 processes total:
const totalProcesses = list.length;
if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
console.error('SAFETY ABORT: Filter would delete ALL processes!');
console.error(
'Total processes: ' + totalProcesses + ', Target processes: ' + targetProcesses.length,
);
process.exit(1);
}
Purpose: Catches filter bugs or unexpected conditions automatically.
Threshold Rationale: A threshold of 3 allows normal operation when only the expected processes exist (API, Worker, Analytics Worker) while catching anomalies when the server hosts additional applications.
Layer 4: Explicit Name Verification
Log the exact name, status, and PM2 ID of each process that will be affected:
console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
targetProcesses.forEach((p) => {
console.log(
' - ' + p.name + ' (status: ' + p.pm2_env.status + ', pm_id: ' + p.pm2_env.pm_id + ')',
);
});
Purpose: Provides clear visibility into cleanup operations.
Layer 5: Post-Cleanup Verification
After cleanup, verify environment isolation was maintained:
echo "=== POST-CLEANUP VERIFICATION ==="
pm2 jlist | node -e "
const list = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
const prodProcesses = list.filter(p => p.name && p.name.startsWith('flyer-crawler-') && !p.name.endsWith('-test'));
console.log('Production processes after cleanup: ' + prodProcesses.length);
"
echo "=== END POST-CLEANUP VERIFICATION ==="
Purpose: Immediately identifies cross-environment contamination.
Consequences
Positive
-
Automatic Prevention: Layer 3 (process count validation) can prevent catastrophic process deletion automatically, without human intervention.
-
Forensic Capability: Layers 1 and 2 provide the data needed to determine root cause after an incident.
-
Visibility: Layers 4 and 5 make PM2 operations transparent in workflow logs.
-
Fail-Safe Design: Even if individual layers fail, other layers provide backup protection.
-
Non-Breaking: Safeguards are additive and do not change the existing filtering logic.
Negative
-
Increased Log Volume: Additional logging increases workflow output size.
-
Minor Performance Impact: Extra PM2 commands add a few seconds to deployment time.
-
Threshold Tuning: The threshold of 3 may need adjustment if the expected process count changes.
Neutral
-
Root Cause Still Unknown: These safeguards mitigate the risk but do not definitively explain why the original incident occurred.
-
No Structural Changes: The underlying architecture (shared PM2 daemon) remains unchanged.
Alternatives Considered
PM2 Namespaces
PM2 supports namespaces to isolate groups of processes. This would provide complete isolation but requires:
- Changes to ecosystem config files
- Changes to all PM2 commands in workflows
- Potential breaking changes to monitoring and log aggregation
Decision: Deferred for future consideration. Current safeguards provide adequate protection.
Separate PM2 Daemons
Running a separate PM2 daemon per application would eliminate cross-application risk entirely.
Decision: Not implemented due to increased operational complexity and the current safeguards being sufficient.
Deployment Locks
Implementing mutex-style locks to prevent concurrent deployments could prevent race conditions.
Decision: Not implemented as the current safeguards address the identified risk. May be reconsidered if concurrent deployment issues are observed.
Implementation
Files Modified
| File | Changes |
|---|---|
.gitea/workflows/deploy-to-prod.yml |
All 5 safeguard layers |
.gitea/workflows/deploy-to-test.yml |
All 5 safeguard layers |
.gitea/workflows/manual-deploy-major.yml |
All 5 safeguard layers |
Validation
A standalone test file validates the safeguard logic:
- File:
tests/qa/test-pm2-safeguard-logic.js - Coverage: 11 scenarios covering normal operations and dangerous edge cases
- Result: All tests pass
Related Documentation
- Incident Report: 2026-02-17
- PM2 Incident Response Runbook
- Session Summary
- CLAUDE.md - PM2 Process Isolation
- ADR-014: Containerization and Deployment Strategy
References
- PM2 Documentation: https://pm2.keymetrics.io/docs/usage/application-declaration/
- Defense in Depth: https://en.wikipedia.org/wiki/Defense_in_depth_(computing)