diff --git a/CLAUDE.md b/CLAUDE.md index 44bd46d2..7db07250 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -255,11 +255,13 @@ The dev container now matches production by using PM2 for process management. | Component | Production | Dev Container | | ---------- | ---------------------- | ------------------------- | -| API Server | PM2 cluster mode | PM2 fork mode + tsx watch | +| API Server | PM2 fork mode | PM2 fork mode + tsx watch | | Worker | PM2 process | PM2 process + tsx watch | | Frontend | Static files via NGINX | PM2 + Vite dev server | | Logs | PM2 logs -> Logstash | PM2 logs -> Logstash | +**Note:** PM2 cluster mode is incompatible with tsx as script path. See [PM2-CLUSTER-MODE-INCOMPATIBILITY.md](docs/operations/PM2-CLUSTER-MODE-INCOMPATIBILITY.md). + **PM2 Processes in Dev Container**: - `flyer-crawler-api-dev` - API server (port 3001) diff --git a/docs/adr/0014-containerization-and-deployment-strategy.md b/docs/adr/0014-containerization-and-deployment-strategy.md index 94851fce..8b09240b 100644 --- a/docs/adr/0014-containerization-and-deployment-strategy.md +++ b/docs/adr/0014-containerization-and-deployment-strategy.md @@ -49,7 +49,7 @@ Tests that pass on Windows but fail on Linux are considered **broken tests**. Te We will standardize the deployment process using a hybrid approach: -1. **PM2 for Production**: Use PM2 cluster mode for process management, load balancing, and zero-downtime reloads. +1. **PM2 for Production**: Use PM2 for process management. ~~Cluster mode~~ Fork mode used due to tsx incompatibility (see [PM2-CLUSTER-MODE-INCOMPATIBILITY.md](../operations/PM2-CLUSTER-MODE-INCOMPATIBILITY.md)). 2. **Docker/Podman for Development**: Provide a complete containerized development environment with automatic initialization. 3. **VS Code Dev Containers**: Enable one-click development environment setup. 4. **Gitea Actions for CI/CD**: Automated deployment pipelines handle builds and deployments. @@ -187,13 +187,13 @@ Located in `ecosystem.config.cjs`: module.exports = { apps: [ { - // API Server - Cluster mode for load balancing + // API Server - Fork mode (tsx incompatible with cluster) name: 'flyer-crawler-api', script: './node_modules/.bin/tsx', args: 'server.ts', max_memory_restart: '500M', - instances: 'max', // Use all CPU cores - exec_mode: 'cluster', // Enable cluster mode + instances: 1, // Fork mode - single instance + exec_mode: 'fork', // tsx requires fork mode kill_timeout: 5000, // Graceful shutdown timeout // Restart configuration @@ -358,6 +358,42 @@ podman-compose -f compose.dev.yml build app **Rationale**: Developers and CI systems should never need to run manual setup commands to execute tests. If the container is running, tests should work. Any deviation from this principle indicates an incomplete container setup. +## Updates + +### 2026-02-19: Cluster Mode Disabled + +**Decision:** Disabled PM2 cluster mode in favor of fork mode (single instance). + +**Reason:** PM2 cluster mode is fundamentally incompatible with using `tsx` as the script path. The configuration pattern: + +```javascript +{ + script: './node_modules/.bin/tsx', + args: 'server.ts', + exec_mode: 'cluster', +} +``` + +causes 75-87% of cluster instances to fail on startup with no clear error messages. Only 1-2 out of 8 instances successfully start, with the rest showing constant restart attempts. + +**Technical Root Cause:** PM2 requires the `node` binary as the interpreter to properly fork cluster workers using Node.js's native cluster module. When `tsx` is the script path, PM2 cannot create cluster workers correctly. + +**Alternative:** To use cluster mode with TypeScript in the future, the correct configuration is: + +```javascript +{ + script: 'server.ts', + interpreter: 'node', + interpreter_args: '--import tsx', // Node 18.19+ + exec_mode: 'cluster', + instances: 'max', +} +``` + +**Impact:** Current traffic does not require cluster mode load balancing. Fork mode with a single instance provides reliable process management without the cluster overhead. + +**Documentation:** See [PM2-CLUSTER-MODE-INCOMPATIBILITY.md](../operations/PM2-CLUSTER-MODE-INCOMPATIBILITY.md) for full details. + ## Related ADRs - [ADR-017](./0017-ci-cd-and-branching-strategy.md) - CI/CD Strategy diff --git a/docs/operations/PM2-CLUSTER-MODE-INCOMPATIBILITY.md b/docs/operations/PM2-CLUSTER-MODE-INCOMPATIBILITY.md new file mode 100644 index 00000000..0b5feb46 --- /dev/null +++ b/docs/operations/PM2-CLUSTER-MODE-INCOMPATIBILITY.md @@ -0,0 +1,128 @@ +# PM2 Cluster Mode Incompatibility with tsx + +**Date Documented:** 2026-02-19 +**Affected Version:** v0.21.0 +**Status:** Resolved by switching to fork mode + +## Issue Summary + +PM2 cluster mode is fundamentally incompatible with using `tsx` as the script path in the ecosystem configuration. This manifests as 6-7 out of 8 cluster instances remaining in `errored` or `stopped` state with constant restart attempts (16-17 restarts observed), while only 1-2 instances successfully start. + +## Root Cause + +The configuration pattern used in [ecosystem.config.cjs:68-73](../../ecosystem.config.cjs#L68-L73): + +```javascript +{ + script: './node_modules/.bin/tsx', + args: 'server.ts', + exec_mode: 'cluster', + instances: 'max', +} +``` + +**This is incompatible with PM2's cluster mode** because PM2 requires the `node` binary as the interpreter to properly fork worker processes using Node.js's native cluster module. When `tsx` is used as the script, PM2 cannot create cluster workers correctly. + +## Technical Explanation + +PM2's cluster mode uses Node.js's built-in `cluster` module to spawn multiple worker processes that share the same server port. This architecture requires: + +1. **The interpreter must be `node`** - PM2 forks the main process using `node` as the binary +2. **TypeScript transpilation must happen via Node.js loaders** - tsx or ts-node must be loaded as a Node.js module loader, not as the executable + +When `tsx` is the script path (not the interpreter), PM2 attempts to treat it as the application entry point, which breaks the cluster fork mechanism. This causes worker processes to fail on startup with no clear error messages in logs (as observed: empty log files despite 16-17 restart attempts). + +## The Correct Way to Use tsx with Cluster Mode + +If cluster mode is required in the future, the proper configuration is: + +```javascript +{ + name: 'flyer-crawler-api', + script: 'server.ts', // Direct TypeScript file + interpreter: 'node', // Use node as interpreter + interpreter_args: '--import tsx', // tsx as Node.js loader (Node 18.19+) + exec_mode: 'cluster', + instances: 'max', +} +``` + +**Important:** For Node.js 18.18 and below, use `--loader tsx` instead of `--import tsx`. + +This configuration: +- Uses `node` as the interpreter (required for cluster mode) +- Loads `tsx` as a Node.js module loader via `--import` +- Allows PM2 to properly fork cluster workers + +## Current Resolution + +Since the application does not have high enough traffic to require cluster mode load balancing, we switched to **fork mode** with a single instance: + +```javascript +{ + instances: 1, + exec_mode: 'fork', +} +``` + +This provides: +- ✅ Reliable process startup +- ✅ Proper TypeScript execution via tsx +- ✅ PM2 process management (restarts, logs, monitoring) +- ❌ No load balancing across CPU cores (acceptable for current traffic) + +## When to Reconsider Cluster Mode + +Cluster mode should be reconsidered when: +1. **Traffic increases significantly** - Multiple CPU cores needed for request handling +2. **Zero-downtime deploys are required** - PM2 reload works only in cluster mode +3. **Configuration is updated** - Use the correct `interpreter` + `interpreter_args` pattern above + +## Diagnostic Commands Used + +```bash +# Check PM2 process status +pm2 ps + +# View logs for failing instances +pm2 logs flyer-crawler-api --namespace flyer-crawler-prod --lines 100 --nostream + +# Get detailed process information +pm2 describe 10 --namespace flyer-crawler-prod + +# Check Node.js and PM2 versions +node --version # v22.22.0 +pm2 --version # v6.0.13 +``` + +## Evidence from Production + +``` +┌────┬────────────────────────────────────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┐ +│ id │ name │ namespace │ version │ mode │ pid │ uptime │ ↺ │ status │ +├────┼────────────────────────────────────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┤ +│ 10 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2032773 │ 6m │ 0 │ online │ +│ 13 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044125 │ 0 │ 16 │ errored │ +│ 14 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044198 │ 0 │ 16 │ errored │ +│ 15 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044221 │ 0 │ 16 │ errored │ +│ 16 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044179 │ 0 │ 16 │ errored │ +│ 17 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044358 │ 12s │ 17 │ online │ +│ 18 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044100 │ 0 │ 16 │ errored │ +│ 19 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044243 │ 0 │ 17 │ errored │ +└────┴────────────────────────────────────────┴─────────────┴─────────┴─────────┴──────────┴────────┴──────┴───────────┘ +``` + +Only 2 out of 8 cluster instances were online; the rest showed 16-17 restart attempts with empty log files. + +## References + +- [PM2 — Use TSX to Start Your App](https://futurestud.io/tutorials/pm2-use-tsx-to-start-your-app) +- [Running typescript app with pm2 and tsx](https://blog.vramana.com/posts/2023-02-05-pm2-tsx/) +- [PM2 and cluster mode in Node.js/TypeScript · Issue #5790](https://github.com/Unitech/pm2/issues/5790) +- [PM2 - Cluster Mode Documentation](https://pm2.keymetrics.io/docs/usage/cluster-mode/) +- [PM2 - Transpilers Integration](https://pm2.io/docs/runtime/integration/transpilers/) + +## Related ADRs + +- [ADR-014: Containerization and Deployment Strategy](../adr/0014-containerization-and-deployment-strategy.md) - Original cluster mode decision +- [ADR-063: PM2 Namespace Implementation](../adr/0063-pm2-namespace-implementation.md) - PM2 namespace isolation diff --git a/ecosystem.config.cjs b/ecosystem.config.cjs index b9203984..49ddc495 100644 --- a/ecosystem.config.cjs +++ b/ecosystem.config.cjs @@ -69,8 +69,8 @@ module.exports = { args: 'server.ts', cwd: '/var/www/flyer-crawler.projectium.com', max_memory_restart: '500M', - instances: 'max', - exec_mode: 'cluster', + instances: 1, + exec_mode: 'fork', kill_timeout: 5000, log_date_format: 'YYYY-MM-DD HH:mm:ss Z', max_restarts: 40,