PM2 Process Isolation

2026-02-17 20:46:28 -08:00
parent 93ad624658
commit c059b30201
11 changed files with 2228 additions and 7 deletions
--- a/docs/README.md
+++ b/docs/README.md
@@ -47,6 +47,14 @@ Production operations and deployment:
 - [Logstash Troubleshooting](operations/LOGSTASH-TROUBLESHOOTING.md) - Debugging logs
 - [Monitoring](operations/MONITORING.md) - Bugsink, health checks, observability

+**Incident Response**:
+
+- [PM2 Incident Response Runbook](operations/PM2-INCIDENT-RESPONSE.md) - Step-by-step procedures for PM2 incidents
+
+**Incident Reports**:
+
+- [2026-02-17 PM2 Process Kill](operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md) - ALL PM2 processes killed during v0.15.0 deployment (Mitigated)
+
 **NGINX Reference Configs** (in repository root):

 - `etc-nginx-sites-available-flyer-crawler.projectium.com` - Production server config
--- a/docs/adr/0061-pm2-process-isolation-safeguards.md
+++ b/docs/adr/0061-pm2-process-isolation-safeguards.md
@@ -0,0 +1,199 @@
+# ADR-061: PM2 Process Isolation Safeguards
+
+## Status
+
+Accepted
+
+## Context
+
+On 2026-02-17, a critical incident occurred during v0.15.0 production deployment where ALL PM2 processes on the production server were terminated, not just flyer-crawler processes. This caused unplanned downtime for multiple applications including `stock-alert.projectium.com`.
+
+### Problem Statement
+
+Production and test environments share the same PM2 daemon on the server. This creates a risk where deployment scripts that operate on PM2 processes can accidentally affect processes belonging to other applications or environments.
+
+### Pre-existing Controls
+
+Prior to the incident, PM2 process isolation controls were already in place (commit `b6a62a0`):
+
+- Production workflows used whitelist-based filtering with explicit process names
+- Test workflows filtered by `-test` suffix pattern
+- CLAUDE.md documented the prohibition of `pm2 stop all`, `pm2 delete all`, and `pm2 restart all`
+
+Despite these controls being present in the codebase and included in v0.15.0, the incident still occurred. The leading hypothesis is that the Gitea runner executed a cached/older version of the workflow file.
+
+### Requirements
+
+1. Prevent accidental deletion of processes from other applications or environments
+2. Provide audit trail for forensic analysis when incidents occur
+3. Enable automatic abort when dangerous conditions are detected
+4. Maintain visibility into PM2 operations during deployment
+5. Work correctly even if the filtering logic itself is bypassed
+
+## Decision
+
+Implement a defense-in-depth strategy with 5 layers of safeguards in all deployment workflows that interact with PM2 processes.
+
+### Safeguard Layers
+
+#### Layer 1: Workflow Metadata Logging
+
+Log workflow execution metadata at the start of each deployment:
+
+```bash
+echo "=== WORKFLOW METADATA ==="
+echo "Workflow file: deploy-to-prod.yml"
+echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
+echo "Git commit: $(git rev-parse HEAD)"
+echo "Git branch: $(git rev-parse --abbrev-ref HEAD)"
+echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
+echo "Actor: ${{ gitea.actor }}"
+echo "=== END METADATA ==="
+```
+
+**Purpose**: Enables verification of which workflow version was actually executed.
+
+#### Layer 2: Pre-Cleanup PM2 State Logging
+
+Capture full PM2 process list before any modifications:
+
+```bash
+echo "=== PRE-CLEANUP PM2 STATE ==="
+pm2 jlist
+echo "=== END PRE-CLEANUP STATE ==="
+```
+
+**Purpose**: Provides forensic evidence of system state before cleanup.
+
+#### Layer 3: Process Count Validation (SAFETY ABORT)
+
+Abort deployment if the filter would delete ALL processes and there are more than 3 processes total:
+
+```javascript
+const totalProcesses = list.length;
+if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
+  console.error('SAFETY ABORT: Filter would delete ALL processes!');
+  console.error(
+    'Total processes: ' + totalProcesses + ', Target processes: ' + targetProcesses.length,
+  );
+  process.exit(1);
+}
+```
+
+**Purpose**: Catches filter bugs or unexpected conditions automatically.
+
+**Threshold Rationale**: A threshold of 3 allows normal operation when only the expected processes exist (API, Worker, Analytics Worker) while catching anomalies when the server hosts additional applications.
+
+#### Layer 4: Explicit Name Verification
+
+Log the exact name, status, and PM2 ID of each process that will be affected:
+
+```javascript
+console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
+targetProcesses.forEach((p) => {
+  console.log(
+    '  - ' + p.name + ' (status: ' + p.pm2_env.status + ', pm_id: ' + p.pm2_env.pm_id + ')',
+  );
+});
+```
+
+**Purpose**: Provides clear visibility into cleanup operations.
+
+#### Layer 5: Post-Cleanup Verification
+
+After cleanup, verify environment isolation was maintained:
+
+```bash
+echo "=== POST-CLEANUP VERIFICATION ==="
+pm2 jlist | node -e "
+  const list = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
+  const prodProcesses = list.filter(p => p.name && p.name.startsWith('flyer-crawler-') && !p.name.endsWith('-test'));
+  console.log('Production processes after cleanup: ' + prodProcesses.length);
+"
+echo "=== END POST-CLEANUP VERIFICATION ==="
+```
+
+**Purpose**: Immediately identifies cross-environment contamination.
+
+## Consequences
+
+### Positive
+
+1. **Automatic Prevention**: Layer 3 (process count validation) can prevent catastrophic process deletion automatically, without human intervention.
+
+2. **Forensic Capability**: Layers 1 and 2 provide the data needed to determine root cause after an incident.
+
+3. **Visibility**: Layers 4 and 5 make PM2 operations transparent in workflow logs.
+
+4. **Fail-Safe Design**: Even if individual layers fail, other layers provide backup protection.
+
+5. **Non-Breaking**: Safeguards are additive and do not change the existing filtering logic.
+
+### Negative
+
+1. **Increased Log Volume**: Additional logging increases workflow output size.
+
+2. **Minor Performance Impact**: Extra PM2 commands add a few seconds to deployment time.
+
+3. **Threshold Tuning**: The threshold of 3 may need adjustment if the expected process count changes.
+
+### Neutral
+
+1. **Root Cause Still Unknown**: These safeguards mitigate the risk but do not definitively explain why the original incident occurred.
+
+2. **No Structural Changes**: The underlying architecture (shared PM2 daemon) remains unchanged.
+
+## Alternatives Considered
+
+### PM2 Namespaces
+
+PM2 supports namespaces to isolate groups of processes. This would provide complete isolation but requires:
+
+- Changes to ecosystem config files
+- Changes to all PM2 commands in workflows
+- Potential breaking changes to monitoring and log aggregation
+
+**Decision**: Deferred for future consideration. Current safeguards provide adequate protection.
+
+### Separate PM2 Daemons
+
+Running a separate PM2 daemon per application would eliminate cross-application risk entirely.
+
+**Decision**: Not implemented due to increased operational complexity and the current safeguards being sufficient.
+
+### Deployment Locks
+
+Implementing mutex-style locks to prevent concurrent deployments could prevent race conditions.
+
+**Decision**: Not implemented as the current safeguards address the identified risk. May be reconsidered if concurrent deployment issues are observed.
+
+## Implementation
+
+### Files Modified
+
+| File                                       | Changes                |
+| ------------------------------------------ | ---------------------- |
+| `.gitea/workflows/deploy-to-prod.yml`      | All 5 safeguard layers |
+| `.gitea/workflows/deploy-to-test.yml`      | All 5 safeguard layers |
+| `.gitea/workflows/manual-deploy-major.yml` | All 5 safeguard layers |
+
+### Validation
+
+A standalone test file validates the safeguard logic:
+
+- **File**: `tests/qa/test-pm2-safeguard-logic.js`
+- **Coverage**: 11 scenarios covering normal operations and dangerous edge cases
+- **Result**: All tests pass
+
+## Related Documentation
+
+- [Incident Report: 2026-02-17](../operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md)
+- [PM2 Incident Response Runbook](../operations/PM2-INCIDENT-RESPONSE.md)
+- [Session Summary](../archive/sessions/PM2_SAFEGUARDS_SESSION_2026-02-17.md)
+- [CLAUDE.md - PM2 Process Isolation](../../CLAUDE.md#pm2-process-isolation-productiontest-servers)
+- [ADR-014: Containerization and Deployment Strategy](0014-containerization-and-deployment-strategy.md)
+
+## References
+
+- PM2 Documentation: https://pm2.keymetrics.io/docs/usage/application-declaration/
+- Defense in Depth: https://en.wikipedia.org/wiki/Defense_in_depth_(computing)
--- a/docs/adr/index.md
+++ b/docs/adr/index.md
@@ -56,6 +56,7 @@ This directory contains a log of the architectural decisions made for the Flyer
 **[ADR-038](./0038-graceful-shutdown-pattern.md)**: Graceful Shutdown Pattern (Accepted)
 **[ADR-053](./0053-worker-health-checks.md)**: Worker Health Checks and Stalled Job Monitoring (Accepted)
 **[ADR-054](./0054-bugsink-gitea-issue-sync.md)**: Bugsink to Gitea Issue Synchronization (Proposed)
+**[ADR-061](./0061-pm2-process-isolation-safeguards.md)**: PM2 Process Isolation Safeguards (Accepted)

 ## 7. Frontend / User Interface

--- a/docs/archive/sessions/PM2_SAFEGUARDS_SESSION_2026-02-17.md
+++ b/docs/archive/sessions/PM2_SAFEGUARDS_SESSION_2026-02-17.md
@@ -0,0 +1,377 @@
+# PM2 Process Isolation Safeguards Project
+
+**Session Date**: 2026-02-17
+**Status**: Completed
+**Triggered By**: Critical production incident during v0.15.0 deployment
+
+---
+
+## Executive Summary
+
+On 2026-02-17, a critical incident occurred during v0.15.0 production deployment where ALL PM2 processes on the production server were killed, not just the flyer-crawler processes. This caused unplanned downtime for multiple applications including `stock-alert.projectium.com`.
+
+Despite PM2 process isolation fixes already being in place (commit `b6a62a0`), the incident still occurred. Investigation suggests the Gitea runner may have executed a cached/older version of the workflow files. In response, we implemented a comprehensive defense-in-depth strategy with 5 layers of safeguards across all deployment workflows.
+
+---
+
+## Incident Background
+
+### What Happened
+
+| Aspect                | Detail                                                  |
+| --------------------- | ------------------------------------------------------- |
+| **Date/Time**         | 2026-02-17 ~07:40 UTC                                   |
+| **Trigger**           | v0.15.0 production deployment via `deploy-to-prod.yml`  |
+| **Impact**            | ALL PM2 processes killed (all environments)             |
+| **Collateral Damage** | `stock-alert.projectium.com` and other PM2-managed apps |
+| **Severity**          | P1 - Critical                                           |
+
+### Key Mystery
+
+The PM2 process isolation fix was already implemented in commit `b6a62a0` (2026-02-13) and was included in v0.15.0. The fix correctly used whitelist-based filtering:
+
+```javascript
+const prodProcesses = [
+  'flyer-crawler-api',
+  'flyer-crawler-worker',
+  'flyer-crawler-analytics-worker',
+];
+list.forEach((p) => {
+  if (
+    (p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
+    prodProcesses.includes(p.name)
+  ) {
+    exec('pm2 delete ' + p.pm2_env.pm_id);
+  }
+});
+```
+
+**Hypothesis**: Gitea runner executed a cached older version of the workflow file that did not contain the fix.
+
+---
+
+## Solution: Defense-in-Depth Safeguards
+
+Rather than relying solely on the filter logic (which may be correct but not executed), we implemented 5 layers of safeguards that provide visibility, validation, and automatic abort capabilities.
+
+### Safeguard Layers
+
+| Layer | Name                              | Purpose                                                 |
+| ----- | --------------------------------- | ------------------------------------------------------- |
+| 1     | **Workflow Metadata Logging**     | Audit trail of which workflow version actually executed |
+| 2     | **Pre-Cleanup PM2 State Logging** | Capture full process list before any modifications      |
+| 3     | **Process Count Validation**      | SAFETY ABORT if filter would delete ALL processes       |
+| 4     | **Explicit Name Verification**    | Log exactly which processes will be affected            |
+| 5     | **Post-Cleanup Verification**     | Verify environment isolation after cleanup              |
+
+### Layer Details
+
+#### Layer 1: Workflow Metadata Logging
+
+Logs at the start of deployment:
+
+- Workflow file name
+- SHA-256 hash of the workflow file
+- Git commit being deployed
+- Git branch
+- Timestamp (UTC)
+- Actor (who triggered the deployment)
+
+**Purpose**: If an incident occurs, we can verify whether the executed workflow matches the repository version.
+
+```bash
+echo "=== WORKFLOW METADATA ==="
+echo "Workflow file: deploy-to-prod.yml"
+echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
+echo "Git commit: $(git rev-parse HEAD)"
+echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
+echo "Actor: ${{ gitea.actor }}"
+echo "=== END METADATA ==="
+```
+
+#### Layer 2: Pre-Cleanup PM2 State Logging
+
+Captures full PM2 process list in JSON format before any modifications.
+
+**Purpose**: Provides forensic evidence of what processes existed before cleanup began.
+
+```bash
+echo "=== PRE-CLEANUP PM2 STATE ==="
+pm2 jlist
+echo "=== END PRE-CLEANUP STATE ==="
+```
+
+#### Layer 3: Process Count Validation (SAFETY ABORT)
+
+The most critical safeguard. Aborts the entire deployment if the filter would delete ALL processes and there are more than 3 processes total.
+
+**Purpose**: Catches filter bugs or unexpected conditions that would result in catastrophic process deletion.
+
+```javascript
+// SAFEGUARD 1: Process count validation
+const totalProcesses = list.length;
+if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
+  console.error('SAFETY ABORT: Filter would delete ALL processes!');
+  console.error(
+    'Total processes: ' + totalProcesses + ', Target processes: ' + targetProcesses.length,
+  );
+  console.error('This indicates a potential filter bug. Aborting cleanup.');
+  process.exit(1);
+}
+```
+
+**Threshold Rationale**: The threshold of 3 allows normal operation when only the 3 expected processes exist (API, Worker, Analytics Worker) while catching anomalies when the server hosts more applications.
+
+#### Layer 4: Explicit Name Verification
+
+Logs the exact name, status, and PM2 ID of each process that will be deleted.
+
+**Purpose**: Provides clear visibility into what the cleanup operation will actually do.
+
+```javascript
+console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
+targetProcesses.forEach((p) => {
+  console.log(
+    '  - ' + p.name + ' (status: ' + p.pm2_env.status + ', pm_id: ' + p.pm2_env.pm_id + ')',
+  );
+});
+```
+
+#### Layer 5: Post-Cleanup Verification
+
+After cleanup, logs the state of processes by environment to verify isolation was maintained.
+
+**Purpose**: Immediately identifies if the cleanup affected the wrong environment.
+
+```bash
+echo "=== POST-CLEANUP VERIFICATION ==="
+pm2 jlist | node -e "
+  const list = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
+  const prodProcesses = list.filter(p => p.name && p.name.startsWith('flyer-crawler-') && !p.name.endsWith('-test'));
+  const testProcesses = list.filter(p => p.name && p.name.endsWith('-test'));
+  console.log('Production processes after cleanup: ' + prodProcesses.length);
+  console.log('Test processes (should be untouched): ' + testProcesses.length);
+"
+echo "=== END POST-CLEANUP VERIFICATION ==="
+```
+
+---
+
+## Implementation Details
+
+### Files Modified
+
+| File                                       | Changes                                       |
+| ------------------------------------------ | --------------------------------------------- |
+| `.gitea/workflows/deploy-to-prod.yml`      | Added all 5 safeguard layers                  |
+| `.gitea/workflows/deploy-to-test.yml`      | Added all 5 safeguard layers                  |
+| `.gitea/workflows/manual-deploy-major.yml` | Added all 5 safeguard layers                  |
+| `CLAUDE.md`                                | Added PM2 Process Isolation Incidents section |
+
+### Files Created
+
+| File                                                      | Purpose                                 |
+| --------------------------------------------------------- | --------------------------------------- |
+| `docs/operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md` | Detailed incident report                |
+| `docs/operations/PM2-INCIDENT-RESPONSE.md`                | Comprehensive incident response runbook |
+| `tests/qa/test-pm2-safeguard-logic.js`                    | Validation tests for safeguard logic    |
+
+---
+
+## Testing and Validation
+
+### Test Artifact
+
+A standalone JavaScript test file was created to validate the safeguard logic:
+
+**File**: `tests/qa/test-pm2-safeguard-logic.js`
+
+**Test Categories**:
+
+1. **Normal Operations (should NOT abort)**
+   - 3 errored out of 15 processes
+   - 1 errored out of 10 processes
+   - 0 processes to clean
+   - Fresh server with 3 processes (threshold boundary)
+
+2. **Dangerous Operations (SHOULD abort)**
+   - All 10 processes targeted
+   - All 15 processes targeted
+   - All 4 processes targeted (just above threshold)
+
+3. **Workflow-Specific Filter Tests**
+   - Production filter only matches production processes
+   - Test filter only matches `-test` suffix processes
+   - Filters don't cross-contaminate environments
+
+### Test Results
+
+All 11 scenarios passed:
+
+| Scenario                   | Total | Target | Expected | Result |
+| -------------------------- | ----- | ------ | -------- | ------ |
+| Normal prod cleanup        | 15    | 3      | No abort | PASS   |
+| Normal test cleanup        | 15    | 3      | No abort | PASS   |
+| Single process             | 10    | 1      | No abort | PASS   |
+| No cleanup needed          | 10    | 0      | No abort | PASS   |
+| Fresh server (threshold)   | 3     | 3      | No abort | PASS   |
+| Minimal server             | 2     | 2      | No abort | PASS   |
+| Empty PM2                  | 0     | 0      | No abort | PASS   |
+| Filter bug - 10 processes  | 10    | 10     | ABORT    | PASS   |
+| Filter bug - 15 processes  | 15    | 15     | ABORT    | PASS   |
+| Filter bug - 4 processes   | 4     | 4      | ABORT    | PASS   |
+| Filter bug - 100 processes | 100   | 100    | ABORT    | PASS   |
+
+### YAML Validation
+
+All workflow files passed YAML syntax validation using `python -c "import yaml; yaml.safe_load(open(...))"`
+
+---
+
+## Documentation Updates
+
+### CLAUDE.md Updates
+
+Added new section at line 293: **PM2 Process Isolation Incidents**
+
+Contains:
+
+- Reference to the 2026-02-17 incident
+- Impact summary
+- Prevention measures list
+- Response instructions
+- Links to related documentation
+
+### docs/README.md
+
+Added incident report reference under **Operations > Incident Reports**.
+
+### Cross-References Verified
+
+| Document        | Reference                               | Status |
+| --------------- | --------------------------------------- | ------ |
+| CLAUDE.md       | PM2-INCIDENT-RESPONSE.md                | Valid  |
+| CLAUDE.md       | INCIDENT-2026-02-17-PM2-PROCESS-KILL.md | Valid  |
+| Incident Report | CLAUDE.md PM2 section                   | Valid  |
+| Incident Report | PM2-INCIDENT-RESPONSE.md                | Valid  |
+| docs/README.md  | INCIDENT-2026-02-17-PM2-PROCESS-KILL.md | Valid  |
+
+---
+
+## Lessons Learned
+
+### Technical Lessons
+
+1. **Filter logic alone is not sufficient** - Even correct filters can be bypassed if an older version of the script is executed.
+
+2. **Workflow caching is a real risk** - CI/CD runners may cache workflow files, leading to stale versions being executed.
+
+3. **Defense-in-depth is essential for destructive operations** - Multiple layers of validation catch failures that single-point checks miss.
+
+4. **Visibility enables diagnosis** - Pre/post state logging makes root cause analysis possible.
+
+5. **Automatic abort prevents cascading failures** - The process count validation could have prevented the incident entirely.
+
+### Process Lessons
+
+1. **Shared PM2 daemons are risky** - Multiple applications sharing a PM2 daemon create cross-application dependencies.
+
+2. **Documentation should include failure modes** - CLAUDE.md now explicitly documents what can go wrong and how to respond.
+
+3. **Runbooks save time during incidents** - The incident response runbook provides step-by-step guidance when time is critical.
+
+---
+
+## Future Considerations
+
+### Not Implemented (Potential Future Work)
+
+1. **PM2 Namespacing** - Use PM2's native namespace feature to completely isolate environments.
+
+2. **Separate PM2 Daemons** - Run one PM2 daemon per application to eliminate cross-application risk.
+
+3. **Deployment Locks** - Implement mutex-style locks to prevent concurrent deployments.
+
+4. **Workflow Version Verification** - Add a pre-flight check that compares workflow hash against expected value.
+
+5. **Automated Rollback** - Implement automatic process restoration if safeguards detect a problem.
+
+---
+
+## Related Documentation
+
+- **ADR-061**: [PM2 Process Isolation Safeguards](../../adr/0061-pm2-process-isolation-safeguards.md)
+- **Incident Report**: [INCIDENT-2026-02-17-PM2-PROCESS-KILL.md](../../operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md)
+- **Response Runbook**: [PM2-INCIDENT-RESPONSE.md](../../operations/PM2-INCIDENT-RESPONSE.md)
+- **CLAUDE.md Section**: [PM2 Process Isolation Incidents](../../../CLAUDE.md#pm2-process-isolation-incidents)
+- **Test Artifact**: [test-pm2-safeguard-logic.js](../../../tests/qa/test-pm2-safeguard-logic.js)
+- **ADR-014**: [Containerization and Deployment Strategy](../../adr/0014-containerization-and-deployment-strategy.md)
+
+---
+
+## Appendix: Workflow Changes Summary
+
+### deploy-to-prod.yml
+
+```diff
+ - name: Log Workflow Metadata
+     run: |
+       echo "=== WORKFLOW METADATA ==="
+       echo "Workflow file: deploy-to-prod.yml"
+       echo "Workflow file hash: $(sha256sum .gitea/workflows/deploy-to-prod.yml | cut -d' ' -f1)"
+       ...
+
+  - name: Install Backend Dependencies and Restart Production Server
+      run: |
+       # === PRE-CLEANUP PM2 STATE LOGGING ===
+       echo "=== PRE-CLEANUP PM2 STATE ==="
+       pm2 jlist
+       echo "=== END PRE-CLEANUP STATE ==="
+
+        # --- Cleanup Errored Processes with Defense-in-Depth Safeguards ---
+        node -e "
+          ...
+         // SAFEGUARD 1: Process count validation
+         if (targetProcesses.length === totalProcesses && totalProcesses > 3) {
+           console.error('SAFETY ABORT: Filter would delete ALL processes!');
+           process.exit(1);
+         }
+
+         // SAFEGUARD 2: Explicit name verification
+         console.log('Found ' + targetProcesses.length + ' PRODUCTION processes to clean:');
+         targetProcesses.forEach(p => {
+           console.log('  - ' + p.name + ' (status: ' + p.pm2_env.status + ')');
+         });
+          ...
+        "
+
+       # === POST-CLEANUP VERIFICATION ===
+       echo "=== POST-CLEANUP VERIFICATION ==="
+       pm2 jlist | node -e "..."
+       echo "=== END POST-CLEANUP VERIFICATION ==="
+```
+
+Similar changes were applied to `deploy-to-test.yml` and `manual-deploy-major.yml`.
+
+---
+
+## Session Participants
+
+| Role         | Agent Type                | Responsibility                        |
+| ------------ | ------------------------- | ------------------------------------- |
+| Orchestrator | Main Claude               | Session coordination and delegation   |
+| Planner      | planner subagent          | Incident analysis and solution design |
+| Documenter   | describer-for-ai subagent | Incident report creation              |
+| Coder #1     | coder subagent            | Workflow safeguard implementation     |
+| Coder #2     | coder subagent            | Incident response runbook creation    |
+| Coder #3     | coder subagent            | CLAUDE.md updates                     |
+| Tester       | tester subagent           | Comprehensive validation              |
+| Archivist    | Lead Technical Archivist  | Final documentation                   |
+
+---
+
+## Revision History
+
+| Date       | Author                   | Change                  |
+| ---------- | ------------------------ | ----------------------- |
+| 2026-02-17 | Lead Technical Archivist | Initial session summary |
--- a/docs/operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md
+++ b/docs/operations/INCIDENT-2026-02-17-PM2-PROCESS-KILL.md
@@ -0,0 +1,269 @@
+# Incident Report: PM2 Process Kill During v0.15.0 Deployment
+
+**Date**: 2026-02-17
+**Severity**: Critical
+**Status**: Mitigated - Safeguards Implemented
+**Affected Systems**: All PM2-managed applications on projectium.com server
+
+---
+
+## Resolution Summary
+
+**Safeguards implemented on 2026-02-17** to prevent recurrence:
+
+1. Workflow metadata logging (audit trail)
+2. Pre-cleanup PM2 state logging (forensics)
+3. Process count validation with SAFETY ABORT (automatic prevention)
+4. Explicit name verification (visibility)
+5. Post-cleanup verification (environment isolation check)
+
+**Documentation created**:
+
+- [PM2 Incident Response Runbook](PM2-INCIDENT-RESPONSE.md)
+- [PM2 Safeguards Session Summary](../archive/sessions/PM2_SAFEGUARDS_SESSION_2026-02-17.md)
+- CLAUDE.md updated with [PM2 Process Isolation Incidents section](../../CLAUDE.md#pm2-process-isolation-incidents)
+
+---
+
+## Summary
+
+During v0.15.0 production deployment, ALL PM2 processes on the server were terminated, not just flyer-crawler processes. This caused unplanned downtime for other applications including stock-alert.
+
+## Timeline
+
+| Time (Approx)         | Event                                                            |
+| --------------------- | ---------------------------------------------------------------- |
+| 2026-02-17 ~07:40 UTC | v0.15.0 production deployment triggered via `deploy-to-prod.yml` |
+| Unknown               | All PM2 processes killed (flyer-crawler AND other apps)          |
+| Unknown               | Incident discovered - stock-alert down                           |
+| 2026-02-17            | Investigation initiated                                          |
+| 2026-02-17            | Defense-in-depth safeguards implemented in all workflows         |
+| 2026-02-17            | Incident response runbook created                                |
+| 2026-02-17            | Status changed to Mitigated                                      |
+
+## Impact
+
+- **Affected Applications**: All PM2-managed processes on projectium.com
+  - flyer-crawler-api, flyer-crawler-worker, flyer-crawler-analytics-worker (expected)
+  - stock-alert (NOT expected - collateral damage)
+  - Potentially other unidentified applications
+- **Downtime Duration**: TBD
+- **User Impact**: Service unavailability for all affected applications
+
+---
+
+## Investigation Findings
+
+### Deployment Workflow Analysis
+
+All deployment workflows were reviewed for PM2 process isolation:
+
+| Workflow                  | PM2 Isolation  | Implementation                                                                                    |
+| ------------------------- | -------------- | ------------------------------------------------------------------------------------------------- |
+| `deploy-to-prod.yml`      | Whitelist      | `prodProcesses = ['flyer-crawler-api', 'flyer-crawler-worker', 'flyer-crawler-analytics-worker']` |
+| `deploy-to-test.yml`      | Pattern        | `p.name.endsWith('-test')`                                                                        |
+| `manual-deploy-major.yml` | Whitelist      | Same as deploy-to-prod                                                                            |
+| `manual-db-restore.yml`   | Explicit names | `pm2 stop flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker`                  |
+
+### Fix Commit Already In Place
+
+The PM2 process isolation fix was implemented in commit `b6a62a0` (2026-02-13):
+
+```
+commit b6a62a036f39ac895271402a61e5cc4227369de7
+Author: Torben Sorensen <torben.sorensen@gmail.com>
+Date:   Fri Feb 13 10:19:28 2026 -0800
+
+    be specific about pm2 processes
+
+Files modified:
+ .gitea/workflows/deploy-to-prod.yml
+ .gitea/workflows/deploy-to-test.yml
+ .gitea/workflows/manual-db-restore.yml
+ .gitea/workflows/manual-deploy-major.yml
+ CLAUDE.md
+```
+
+### v0.15.0 Release Contains Fix
+
+Confirmed: v0.15.0 (commit `93ad624`, 2026-02-18) includes the fix commit:
+
+```
+93ad624 ci: Bump version to 0.15.0 for production release [skip ci]
+...
+b6a62a0 be specific about pm2 processes  <-- Fix commit included
+```
+
+### Current Workflow PM2 Commands
+
+**Production Deploy (`deploy-to-prod.yml` line 170)**:
+
+```javascript
+const prodProcesses = [
+  'flyer-crawler-api',
+  'flyer-crawler-worker',
+  'flyer-crawler-analytics-worker',
+];
+list.forEach((p) => {
+  if (
+    (p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
+    prodProcesses.includes(p.name)
+  ) {
+    exec('pm2 delete ' + p.pm2_env.pm_id);
+  }
+});
+```
+
+**Test Deploy (`deploy-to-test.yml` line 100)**:
+
+```javascript
+list.forEach((p) => {
+  if (p.name && p.name.endsWith('-test')) {
+    exec('pm2 delete ' + p.pm2_env.pm_id);
+  }
+});
+```
+
+Both implementations have proper name filtering and should NOT affect non-flyer-crawler processes.
+
+---
+
+## Discrepancy Analysis
+
+### Key Mystery
+
+**If the fixes are in place, why did ALL processes get killed?**
+
+### Possible Explanations
+
+#### 1. Workflow Version Mismatch (HIGH PROBABILITY)
+
+**Hypothesis**: Gitea runner cached an older version of the workflow file.
+
+- Gitea Actions may cache workflow definitions
+- The runner might have executed an older version without the fix
+- Need to verify: What version of `deploy-to-prod.yml` actually executed?
+
+**Investigation Required**:
+
+- Check Gitea workflow execution logs for actual script content
+- Verify runner workflow caching behavior
+- Compare executed workflow vs repository version
+
+#### 2. Concurrent Workflow Execution (MEDIUM PROBABILITY)
+
+**Hypothesis**: Another workflow ran simultaneously with destructive PM2 commands.
+
+Workflows with potential issues:
+
+- `manual-db-reset-prod.yml` - Does NOT restart PM2 (schema reset only)
+- `manual-redis-flush-prod.yml` - Does NOT touch PM2
+- Test deployment concurrent with prod deployment
+
+**Investigation Required**:
+
+- Check Gitea Actions history for concurrent workflow runs
+- Review timestamps of all workflow executions on 2026-02-17
+
+#### 3. Manual SSH Command (MEDIUM PROBABILITY)
+
+**Hypothesis**: Someone SSH'd to the server and ran `pm2 stop all` or `pm2 delete all` manually.
+
+**Investigation Required**:
+
+- Check server shell history (if available)
+- Review any maintenance windows or manual interventions
+- Ask team members about manual actions
+
+#### 4. PM2 Internal Issue (LOW PROBABILITY)
+
+**Hypothesis**: PM2 daemon crash or corruption caused all processes to stop.
+
+**Investigation Required**:
+
+- Check PM2 daemon logs on server
+- Look for OOM killer events in system logs
+- Check disk space issues during deployment
+
+#### 5. Script Execution Error (LOW PROBABILITY)
+
+**Hypothesis**: JavaScript parsing error caused the filtering logic to be bypassed.
+
+**Investigation Required**:
+
+- Review workflow execution logs for JavaScript errors
+- Test the inline Node.js scripts locally
+- Check for shell escaping issues
+
+---
+
+## Documentation/Code Gaps Identified
+
+### CLAUDE.md Documentation
+
+The PM2 isolation rules are documented in `CLAUDE.md`, but:
+
+- Documentation uses `pm2 restart all` in the Quick Reference table (for dev container - acceptable)
+- Multiple docs still reference `pm2 restart all` without environment context
+- No incident response runbook for PM2 issues
+
+### Workflow Gaps
+
+1. **No Workflow Audit Trail**: No logging of which exact workflow version executed
+2. **No Pre-deployment Verification**: Workflows don't log PM2 state before modifications
+3. **No Cross-Application Impact Assessment**: No mechanism to detect/warn about other apps
+
+---
+
+## Next Steps for Root Cause Analysis
+
+### Immediate (Priority 1)
+
+1. [ ] Retrieve Gitea Actions execution logs for v0.15.0 deployment
+2. [ ] Extract actual executed workflow content from logs
+3. [ ] Check for concurrent workflow executions on 2026-02-17
+4. [ ] Review server PM2 daemon logs around incident time
+
+### Short-term (Priority 2)
+
+5. [ ] Implement pre-deployment PM2 state logging in workflows
+6. [ ] Add workflow version hash logging for audit trail
+7. [ ] Create incident response runbook for PM2/deployment issues
+
+### Long-term (Priority 3)
+
+8. [ ] Evaluate PM2 namespacing for complete process isolation
+9. [ ] Consider separate PM2 daemon per application
+10. [ ] Implement deployment monitoring/alerting
+
+---
+
+## Related Documentation
+
+- [CLAUDE.md - PM2 Process Isolation](../../../CLAUDE.md) (Critical Rules section)
+- [ADR-014: Containerization and Deployment Strategy](../adr/0014-containerization-and-deployment-strategy.md)
+- [Deployment Guide](./DEPLOYMENT.md)
+- Workflow files in `.gitea/workflows/`
+
+---
+
+## Appendix: Commit Timeline
+
+```
+93ad624 ci: Bump version to 0.15.0 for production release [skip ci]  <-- v0.15.0 release
+7dd4f21 ci: Bump version to 0.14.4 [skip ci]
+174b637 even more typescript fixes
+4f80baf ci: Bump version to 0.14.3 [skip ci]
+8450b5e Generate TSOA Spec and Routes
+e4d830a ci: Bump version to 0.14.2 [skip ci]
+b6a62a0 be specific about pm2 processes                               <-- PM2 fix commit
+2d2cd52 Massive Dependency Modernization Project
+```
+
+---
+
+## Revision History
+
+| Date       | Author             | Change                  |
+| ---------- | ------------------ | ----------------------- |
+| 2026-02-17 | Investigation Team | Initial incident report |
--- a/docs/operations/PM2-INCIDENT-RESPONSE.md
+++ b/docs/operations/PM2-INCIDENT-RESPONSE.md
@@ -0,0 +1,818 @@
+# PM2 Incident Response Runbook
+
+**Purpose**: Step-by-step procedures for responding to PM2 process isolation incidents on the projectium.com server.
+
+**Audience**: On-call responders, system administrators, developers with server access.
+
+**Last updated**: 2026-02-17
+
+**Related documentation**:
+
+- [CLAUDE.md - PM2 Process Isolation Rules](../../CLAUDE.md)
+- [Incident Report: 2026-02-17](INCIDENT-2026-02-17-PM2-PROCESS-KILL.md)
+- [Monitoring Guide](MONITORING.md)
+- [Deployment Guide](DEPLOYMENT.md)
+
+---
+
+## Table of Contents
+
+1. [Quick Reference](#quick-reference)
+2. [Detection](#detection)
+3. [Initial Assessment](#initial-assessment)
+4. [Immediate Response](#immediate-response)
+5. [Process Restoration](#process-restoration)
+6. [Root Cause Investigation](#root-cause-investigation)
+7. [Communication Templates](#communication-templates)
+8. [Prevention Measures](#prevention-measures)
+9. [Contact Information](#contact-information)
+10. [Post-Incident Review](#post-incident-review)
+
+---
+
+## Quick Reference
+
+### PM2 Process Inventory
+
+| Application   | Environment | Process Names                                                                                | Config File                 | Directory                                    |
+| ------------- | ----------- | -------------------------------------------------------------------------------------------- | --------------------------- | -------------------------------------------- |
+| Flyer Crawler | Production  | `flyer-crawler-api`, `flyer-crawler-worker`, `flyer-crawler-analytics-worker`                | `ecosystem.config.cjs`      | `/var/www/flyer-crawler.projectium.com`      |
+| Flyer Crawler | Test        | `flyer-crawler-api-test`, `flyer-crawler-worker-test`, `flyer-crawler-analytics-worker-test` | `ecosystem-test.config.cjs` | `/var/www/flyer-crawler-test.projectium.com` |
+| Stock Alert   | Production  | `stock-alert-*`                                                                              | (varies)                    | `/var/www/stock-alert.projectium.com`        |
+
+### Critical Commands
+
+```bash
+# Check PM2 status
+pm2 list
+
+# Check specific process
+pm2 show flyer-crawler-api
+
+# View recent logs
+pm2 logs --lines 50
+
+# Restart specific processes (SAFE)
+pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker
+
+# DO NOT USE (affects ALL apps)
+# pm2 restart all    <-- DANGEROUS
+# pm2 stop all       <-- DANGEROUS
+# pm2 delete all     <-- DANGEROUS
+```
+
+### Severity Classification
+
+| Severity          | Criteria                                      | Response Time       | Example                                         |
+| ----------------- | --------------------------------------------- | ------------------- | ----------------------------------------------- |
+| **P1 - Critical** | Multiple applications down, production impact | Immediate (< 5 min) | All PM2 processes killed                        |
+| **P2 - High**     | Single application down, production impact    | < 15 min            | Flyer Crawler prod down, Stock Alert unaffected |
+| **P3 - Medium**   | Test environment only, no production impact   | < 1 hour            | Test processes killed, production unaffected    |
+
+---
+
+## Detection
+
+### How to Identify a PM2 Incident
+
+**Automated Indicators**:
+
+- Health check failures on `/api/health/ready`
+- Monitoring alerts (UptimeRobot, etc.)
+- Bugsink showing connection errors
+- NGINX returning 502 Bad Gateway
+
+**User-Reported Symptoms**:
+
+- "The site is down"
+- "I can't log in"
+- "Pages are loading slowly then timing out"
+- "I see a 502 error"
+
+**Manual Discovery**:
+
+```bash
+# SSH to server
+ssh gitea-runner@projectium.com
+
+# Check if PM2 is running
+pm2 list
+
+# Expected output shows processes
+# If empty or all errored = incident
+```
+
+### Incident Signature: Process Isolation Violation
+
+When a PM2 incident is caused by process isolation failure, you will see:
+
+```text
+# Expected state (normal):
+-----------------------------------+----+-----+---------+-------+
+| App name                          | id |mode | status  | cpu   |
+-----------------------------------+----+-----+---------+-------+
+| flyer-crawler-api                 | 0  |clust| online  | 0%    |
+| flyer-crawler-worker              | 1  |fork | online  | 0%    |
+| flyer-crawler-analytics-worker    | 2  |fork | online  | 0%    |
+| flyer-crawler-api-test            | 3  |fork | online  | 0%    |
+| flyer-crawler-worker-test         | 4  |fork | online  | 0%    |
+| flyer-crawler-analytics-worker-test| 5 |fork | online  | 0%    |
+| stock-alert-api                   | 6  |fork | online  | 0%    |
+-----------------------------------+----+-----+---------+-------+
+
+# Incident state (isolation violation):
+# All processes missing or errored - not just one app
+-----------------------------------+----+-----+---------+-------+
+| App name                          | id |mode | status  | cpu   |
+-----------------------------------+----+-----+---------+-------+
+# (empty or all processes errored/stopped)
+-----------------------------------+----+-----+---------+-------+
+```
+
+---
+
+## Initial Assessment
+
+### Step 1: Gather Information (2 minutes)
+
+Run these commands and capture output:
+
+```bash
+# 1. Check PM2 status
+pm2 list
+
+# 2. Check PM2 daemon status
+pm2 ping
+
+# 3. Check recent PM2 logs
+pm2 logs --lines 20 --nostream
+
+# 4. Check system status
+systemctl status pm2-gitea-runner --no-pager
+
+# 5. Check disk space
+df -h /
+
+# 6. Check memory
+free -h
+
+# 7. Check recent deployments (in app directory)
+cd /var/www/flyer-crawler.projectium.com
+git log --oneline -5
+```
+
+### Step 2: Determine Scope
+
+| Question                 | Command                                                          | Impact Level                    |
+| ------------------------ | ---------------------------------------------------------------- | ------------------------------- |
+| How many apps affected?  | `pm2 list`                                                       | Count missing/errored processes |
+| Is production down?      | `curl https://flyer-crawler.projectium.com/api/health/ping`      | Yes/No                          |
+| Is test down?            | `curl https://flyer-crawler-test.projectium.com/api/health/ping` | Yes/No                          |
+| Are other apps affected? | `pm2 list \| grep stock-alert`                                   | Yes/No                          |
+
+### Step 3: Classify Severity
+
+```text
+Decision Tree:
+
+Production app(s) down?
+    |
+    +-- YES: Multiple apps affected?
+    |       |
+    |       +-- YES --> P1 CRITICAL (all apps down)
+    |       |
+    |       +-- NO --> P2 HIGH (single app down)
+    |
+    +-- NO: Test environment only?
+            |
+            +-- YES --> P3 MEDIUM
+            |
+            +-- NO --> Investigate further
+```
+
+### Step 4: Document Initial State
+
+Capture this information before making any changes:
+
+```bash
+# Save PM2 state to file
+pm2 jlist > /tmp/pm2-incident-$(date +%Y%m%d-%H%M%S).json
+
+# Save system state
+{
+  echo "=== PM2 List ==="
+  pm2 list
+  echo ""
+  echo "=== Disk Space ==="
+  df -h
+  echo ""
+  echo "=== Memory ==="
+  free -h
+  echo ""
+  echo "=== Recent Git Commits ==="
+  cd /var/www/flyer-crawler.projectium.com && git log --oneline -5
+} > /tmp/incident-state-$(date +%Y%m%d-%H%M%S).txt
+```
+
+---
+
+## Immediate Response
+
+### Priority 1: Stop Ongoing Deployments
+
+If a deployment is currently running:
+
+1. Check Gitea Actions for running workflows
+2. Cancel any in-progress deployment workflows
+3. Do NOT start new deployments until incident resolved
+
+### Priority 2: Assess Which Processes Are Down
+
+```bash
+# Get list of processes and their status
+pm2 list
+
+# Check which processes exist but are errored/stopped
+pm2 jlist | jq '.[] | {name, status: .pm2_env.status}'
+```
+
+### Priority 3: Establish Order of Restoration
+
+Restore in this order (production first, critical path first):
+
+| Priority | Process                               | Rationale                            |
+| -------- | ------------------------------------- | ------------------------------------ |
+| 1        | `flyer-crawler-api`                   | Production API - highest user impact |
+| 2        | `flyer-crawler-worker`                | Production background jobs           |
+| 3        | `flyer-crawler-analytics-worker`      | Production analytics                 |
+| 4        | `stock-alert-*`                       | Other production apps                |
+| 5        | `flyer-crawler-api-test`              | Test environment                     |
+| 6        | `flyer-crawler-worker-test`           | Test background jobs                 |
+| 7        | `flyer-crawler-analytics-worker-test` | Test analytics                       |
+
+---
+
+## Process Restoration
+
+### Scenario A: Flyer Crawler Production Processes Missing
+
+```bash
+# Navigate to production directory
+cd /var/www/flyer-crawler.projectium.com
+
+# Start production processes
+pm2 start ecosystem.config.cjs
+
+# Verify processes started
+pm2 list
+
+# Check health endpoint
+curl -s http://localhost:3001/api/health/ready | jq .
+```
+
+### Scenario B: Flyer Crawler Test Processes Missing
+
+```bash
+# Navigate to test directory
+cd /var/www/flyer-crawler-test.projectium.com
+
+# Start test processes
+pm2 start ecosystem-test.config.cjs
+
+# Verify processes started
+pm2 list
+
+# Check health endpoint
+curl -s http://localhost:3002/api/health/ready | jq .
+```
+
+### Scenario C: Stock Alert Processes Missing
+
+```bash
+# Navigate to stock-alert directory
+cd /var/www/stock-alert.projectium.com
+
+# Start processes (adjust config file name as needed)
+pm2 start ecosystem.config.cjs
+
+# Verify processes started
+pm2 list
+```
+
+### Scenario D: All Processes Missing
+
+Execute restoration in priority order:
+
+```bash
+# 1. Flyer Crawler Production (highest priority)
+cd /var/www/flyer-crawler.projectium.com
+pm2 start ecosystem.config.cjs
+
+# Verify production is healthy before continuing
+curl -s http://localhost:3001/api/health/ready | jq '.data.status'
+# Should return "healthy"
+
+# 2. Stock Alert Production
+cd /var/www/stock-alert.projectium.com
+pm2 start ecosystem.config.cjs
+
+# 3. Flyer Crawler Test (lower priority)
+cd /var/www/flyer-crawler-test.projectium.com
+pm2 start ecosystem-test.config.cjs
+
+# 4. Save PM2 process list
+pm2 save
+
+# 5. Final verification
+pm2 list
+```
+
+### Health Check Verification
+
+After restoration, verify each application:
+
+**Flyer Crawler Production**:
+
+```bash
+# API health
+curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'
+# Expected: "healthy"
+
+# Check all services
+curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.services'
+```
+
+**Flyer Crawler Test**:
+
+```bash
+curl -s https://flyer-crawler-test.projectium.com/api/health/ready | jq '.data.status'
+```
+
+**Stock Alert**:
+
+```bash
+# Adjust URL as appropriate for stock-alert
+curl -s https://stock-alert.projectium.com/api/health/ready | jq '.data.status'
+```
+
+### Verification Checklist
+
+After restoration, confirm:
+
+- [ ] `pm2 list` shows all expected processes as `online`
+- [ ] Production health check returns `healthy`
+- [ ] Test health check returns `healthy` (if applicable)
+- [ ] No processes showing high restart count
+- [ ] No processes showing `errored` or `stopped` status
+- [ ] PM2 process list saved: `pm2 save`
+
+---
+
+## Root Cause Investigation
+
+### Step 1: Check Workflow Execution Logs
+
+```bash
+# Find recent Gitea Actions runs
+# (Access via Gitea web UI: Repository > Actions > Recent Runs)
+
+# Look for these workflows:
+# - deploy-to-prod.yml
+# - deploy-to-test.yml
+# - manual-deploy-major.yml
+# - manual-db-restore.yml
+```
+
+### Step 2: Check PM2 Daemon Logs
+
+```bash
+# PM2 daemon logs
+cat ~/.pm2/pm2.log | tail -100
+
+# PM2 process-specific logs
+ls -la ~/.pm2/logs/
+
+# Recent API logs
+tail -100 ~/.pm2/logs/flyer-crawler-api-out.log
+tail -100 ~/.pm2/logs/flyer-crawler-api-error.log
+```
+
+### Step 3: Check System Logs
+
+```bash
+# System journal for PM2 service
+journalctl -u pm2-gitea-runner -n 100 --no-pager
+
+# Kernel messages (OOM killer, etc.)
+journalctl -k -n 50 --no-pager | grep -i "killed\|oom\|memory"
+
+# Authentication logs (unauthorized access)
+tail -50 /var/log/auth.log
+```
+
+### Step 4: Git History Analysis
+
+```bash
+# Recent commits to deployment workflows
+cd /var/www/flyer-crawler.projectium.com
+git log --oneline -20 -- .gitea/workflows/
+
+# Check what changed in PM2 configs
+git log --oneline -10 -- ecosystem.config.cjs ecosystem-test.config.cjs
+
+# Diff against last known good state
+git diff <last-good-commit> -- .gitea/workflows/ ecosystem*.cjs
+```
+
+### Step 5: Timing Correlation
+
+Create a timeline:
+
+```text
+| Time (UTC) | Event | Source |
+|------------|-------|--------|
+| XX:XX | Last successful health check | Monitoring |
+| XX:XX | Deployment workflow started | Gitea Actions |
+| XX:XX | First failed health check | Monitoring |
+| XX:XX | Incident detected | User report / Alert |
+| XX:XX | Investigation started | On-call |
+```
+
+### Common Root Causes
+
+| Root Cause                   | Evidence                               | Prevention                   |
+| ---------------------------- | -------------------------------------- | ---------------------------- |
+| `pm2 stop all` in workflow   | Workflow logs show "all" command       | Use explicit process names   |
+| `pm2 delete all` in workflow | Empty PM2 list after deploy            | Use whitelist-based deletion |
+| OOM killer                   | `journalctl -k` shows "Killed process" | Increase memory limits       |
+| Disk space exhaustion        | `df -h` shows 100%                     | Log rotation, cleanup        |
+| Manual intervention          | Shell history shows pm2 commands       | Document all manual actions  |
+| Concurrent deployments       | Multiple workflows at same time        | Implement deployment locks   |
+| Workflow caching issue       | Old workflow version executed          | Force workflow refresh       |
+
+---
+
+## Communication Templates
+
+### Incident Notification (Internal)
+
+```text
+Subject: [P1 INCIDENT] PM2 Process Isolation Failure - Multiple Apps Down
+
+Status: INVESTIGATING
+Time Detected: YYYY-MM-DD HH:MM UTC
+Affected Systems: [flyer-crawler-prod, stock-alert-prod, ...]
+
+Summary:
+All PM2 processes on projectium.com server were terminated unexpectedly.
+Multiple production applications are currently down.
+
+Impact:
+- flyer-crawler.projectium.com: DOWN
+- stock-alert.projectium.com: DOWN
+- [other affected apps]
+
+Current Actions:
+- Restoring critical production processes
+- Investigating root cause
+
+Next Update: In 15 minutes or upon status change
+
+Incident Commander: [Name]
+```
+
+### Status Update Template
+
+```text
+Subject: [P1 INCIDENT] PM2 Process Isolation Failure - UPDATE #N
+
+Status: [INVESTIGATING | IDENTIFIED | RESTORING | RESOLVED]
+Time: YYYY-MM-DD HH:MM UTC
+
+Progress Since Last Update:
+- [Action taken]
+- [Discovery made]
+- [Process restored]
+
+Current State:
+- flyer-crawler.projectium.com: [UP|DOWN]
+- stock-alert.projectium.com: [UP|DOWN]
+
+Root Cause: [If identified]
+
+Next Steps:
+- [Planned action]
+
+ETA to Resolution: [If known]
+
+Next Update: In [X] minutes
+```
+
+### Resolution Notification
+
+```text
+Subject: [RESOLVED] PM2 Process Isolation Failure
+
+Status: RESOLVED
+Time Resolved: YYYY-MM-DD HH:MM UTC
+Total Downtime: X minutes
+
+Summary:
+All PM2 processes have been restored. Services are operating normally.
+
+Root Cause:
+[Brief description of what caused the incident]
+
+Impact Summary:
+- flyer-crawler.projectium.com: Down for X minutes
+- stock-alert.projectium.com: Down for X minutes
+- Estimated user impact: [description]
+
+Immediate Actions Taken:
+1. [Action]
+2. [Action]
+
+Follow-up Actions:
+1. [ ] [Preventive measure] - Owner: [Name] - Due: [Date]
+2. [ ] Post-incident review scheduled for [Date]
+
+Post-Incident Review: [Link or scheduled time]
+```
+
+---
+
+## Prevention Measures
+
+### Pre-Deployment Checklist
+
+Before triggering any deployment:
+
+- [ ] Review workflow file for PM2 commands
+- [ ] Confirm no `pm2 stop all`, `pm2 delete all`, or `pm2 restart all`
+- [ ] Verify process names are explicitly listed
+- [ ] Check for concurrent deployment risks
+- [ ] Confirm recent workflow changes were reviewed
+
+### Workflow Review Checklist
+
+When reviewing deployment workflow changes:
+
+- [ ] All PM2 `stop` commands use explicit process names
+- [ ] All PM2 `delete` commands filter by process name pattern
+- [ ] All PM2 `restart` commands use explicit process names
+- [ ] Test deployments filter by `-test` suffix
+- [ ] Production deployments use whitelist array
+
+**Safe Patterns**:
+
+```javascript
+// SAFE: Explicit process names (production)
+const prodProcesses = [
+  'flyer-crawler-api',
+  'flyer-crawler-worker',
+  'flyer-crawler-analytics-worker',
+];
+list.forEach((p) => {
+  if (
+    (p.pm2_env.status === 'errored' || p.pm2_env.status === 'stopped') &&
+    prodProcesses.includes(p.name)
+  ) {
+    exec('pm2 delete ' + p.pm2_env.pm_id);
+  }
+});
+
+// SAFE: Pattern-based filtering (test)
+list.forEach((p) => {
+  if (p.name && p.name.endsWith('-test')) {
+    exec('pm2 delete ' + p.pm2_env.pm_id);
+  }
+});
+```
+
+**Dangerous Patterns** (NEVER USE):
+
+```bash
+# DANGEROUS - affects ALL applications
+pm2 stop all
+pm2 delete all
+pm2 restart all
+
+# DANGEROUS - no name filtering
+pm2 delete $(pm2 jlist | jq -r '.[] | select(.pm2_env.status == "errored") | .pm_id')
+```
+
+### PM2 Configuration Validation
+
+Before deploying PM2 config changes:
+
+```bash
+# Test configuration locally
+cd /var/www/flyer-crawler.projectium.com
+node -e "console.log(JSON.stringify(require('./ecosystem.config.cjs'), null, 2))"
+
+# Verify process names
+node -e "require('./ecosystem.config.cjs').apps.forEach(a => console.log(a.name))"
+
+# Expected output should match documented process names
+```
+
+### Deployment Monitoring
+
+After every deployment:
+
+```bash
+# Immediate verification
+pm2 list
+
+# Check no unexpected processes were affected
+pm2 list | grep -v flyer-crawler
+# Should still show other apps (e.g., stock-alert)
+
+# Health check
+curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'
+```
+
+---
+
+## Contact Information
+
+### On-Call Escalation
+
+| Role              | Contact        | When to Escalate                    |
+| ----------------- | -------------- | ----------------------------------- |
+| Primary On-Call   | [Name/Channel] | First responder                     |
+| Secondary On-Call | [Name/Channel] | If primary unavailable after 10 min |
+| Engineering Lead  | [Name/Channel] | P1 incidents > 30 min               |
+| Product Owner     | [Name/Channel] | User communication needed           |
+
+### External Dependencies
+
+| Service         | Support Channel | When to Contact         |
+| --------------- | --------------- | ----------------------- |
+| Server Provider | [Contact info]  | Hardware/network issues |
+| DNS Provider    | [Contact info]  | DNS resolution failures |
+| SSL Certificate | [Contact info]  | Certificate issues      |
+
+### Communication Channels
+
+| Channel        | Purpose                    |
+| -------------- | -------------------------- |
+| `#incidents`   | Real-time incident updates |
+| `#deployments` | Deployment announcements   |
+| `#engineering` | Technical discussion       |
+| Email list     | Formal notifications       |
+
+---
+
+## Post-Incident Review
+
+### Incident Report Template
+
+```markdown
+# Incident Report: [Title]
+
+## Overview
+
+| Field              | Value             |
+| ------------------ | ----------------- |
+| Date               | YYYY-MM-DD        |
+| Duration           | X hours Y minutes |
+| Severity           | P1/P2/P3          |
+| Incident Commander | [Name]            |
+| Status             | Resolved          |
+
+## Timeline
+
+| Time (UTC) | Event               |
+| ---------- | ------------------- |
+| HH:MM      | [Event description] |
+| HH:MM      | [Event description] |
+
+## Impact
+
+- **Users affected**: [Number/description]
+- **Revenue impact**: [If applicable]
+- **SLA impact**: [If applicable]
+
+## Root Cause
+
+[Detailed technical explanation]
+
+## Resolution
+
+[What was done to resolve the incident]
+
+## Contributing Factors
+
+1. [Factor]
+2. [Factor]
+
+## Action Items
+
+| Action   | Owner  | Due Date | Status |
+| -------- | ------ | -------- | ------ |
+| [Action] | [Name] | [Date]   | [ ]    |
+
+## Lessons Learned
+
+### What Went Well
+
+- [Item]
+
+### What Could Be Improved
+
+- [Item]
+
+## Appendix
+
+- Link to monitoring data
+- Link to relevant logs
+- Link to workflow runs
+```
+
+### Lessons Learned Format
+
+Use "5 Whys" technique:
+
+```text
+Problem: All PM2 processes were killed during deployment
+
+Why 1: The deployment workflow ran `pm2 delete all`
+Why 2: The workflow used an outdated version of the script
+Why 3: Gitea runner cached the old workflow file
+Why 4: No mechanism to verify workflow version before execution
+Why 5: Workflow versioning and audit trail not implemented
+
+Root Cause: Lack of workflow versioning and execution verification
+
+Preventive Measure: Implement workflow hash logging and pre-execution verification
+```
+
+### Action Items Tracking
+
+Create Gitea issues for each action item:
+
+```bash
+# Example using Gitea CLI or API
+gh issue create --title "Implement PM2 state logging in deployment workflows" \
+  --body "Related to incident YYYY-MM-DD. Add pre-deployment PM2 state capture." \
+  --label "incident-follow-up,priority:high"
+```
+
+Track action items in a central location:
+
+| Issue # | Action                           | Owner  | Due    | Status |
+| ------- | -------------------------------- | ------ | ------ | ------ |
+| #123    | Add PM2 state logging            | [Name] | [Date] | Open   |
+| #124    | Implement workflow version hash  | [Name] | [Date] | Open   |
+| #125    | Create deployment lock mechanism | [Name] | [Date] | Open   |
+
+---
+
+## Appendix: PM2 Command Reference
+
+### Safe Commands
+
+```bash
+# Status and monitoring
+pm2 list
+pm2 show <process-name>
+pm2 monit
+pm2 logs <process-name>
+
+# Restart specific processes
+pm2 restart flyer-crawler-api
+pm2 restart flyer-crawler-api flyer-crawler-worker flyer-crawler-analytics-worker
+
+# Reload (zero-downtime, cluster mode only)
+pm2 reload flyer-crawler-api
+
+# Start from config
+pm2 start ecosystem.config.cjs
+pm2 start ecosystem.config.cjs --only flyer-crawler-api
+```
+
+### Dangerous Commands (Use With Caution)
+
+```bash
+# CAUTION: These affect ALL processes
+pm2 stop all        # Stops every PM2 process
+pm2 restart all     # Restarts every PM2 process
+pm2 delete all      # Removes every PM2 process
+
+# CAUTION: Modifies saved process list
+pm2 save            # Overwrites saved process list
+pm2 resurrect       # Restores from saved list
+
+# CAUTION: Affects PM2 daemon
+pm2 kill            # Kills PM2 daemon and all processes
+pm2 update          # Updates PM2 in place (may cause brief outage)
+```
+
+---
+
+## Revision History
+
+| Date       | Author                 | Change                   |
+| ---------- | ---------------------- | ------------------------ |
+| 2026-02-17 | Incident Response Team | Initial runbook creation |