no longer use cluster mode - fix was found but not enough traffic anyway
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 41m58s

This commit is contained in:
2026-02-19 16:44:19 -08:00
parent 209d7ceba1
commit dfeb2f1c3d
4 changed files with 173 additions and 7 deletions

View File

@@ -255,11 +255,13 @@ The dev container now matches production by using PM2 for process management.
| Component | Production | Dev Container |
| ---------- | ---------------------- | ------------------------- |
| API Server | PM2 cluster mode | PM2 fork mode + tsx watch |
| API Server | PM2 fork mode | PM2 fork mode + tsx watch |
| Worker | PM2 process | PM2 process + tsx watch |
| Frontend | Static files via NGINX | PM2 + Vite dev server |
| Logs | PM2 logs -> Logstash | PM2 logs -> Logstash |
**Note:** PM2 cluster mode is incompatible with tsx as script path. See [PM2-CLUSTER-MODE-INCOMPATIBILITY.md](docs/operations/PM2-CLUSTER-MODE-INCOMPATIBILITY.md).
**PM2 Processes in Dev Container**:
- `flyer-crawler-api-dev` - API server (port 3001)

View File

@@ -49,7 +49,7 @@ Tests that pass on Windows but fail on Linux are considered **broken tests**. Te
We will standardize the deployment process using a hybrid approach:
1. **PM2 for Production**: Use PM2 cluster mode for process management, load balancing, and zero-downtime reloads.
1. **PM2 for Production**: Use PM2 for process management. ~~Cluster mode~~ Fork mode used due to tsx incompatibility (see [PM2-CLUSTER-MODE-INCOMPATIBILITY.md](../operations/PM2-CLUSTER-MODE-INCOMPATIBILITY.md)).
2. **Docker/Podman for Development**: Provide a complete containerized development environment with automatic initialization.
3. **VS Code Dev Containers**: Enable one-click development environment setup.
4. **Gitea Actions for CI/CD**: Automated deployment pipelines handle builds and deployments.
@@ -187,13 +187,13 @@ Located in `ecosystem.config.cjs`:
module.exports = {
apps: [
{
// API Server - Cluster mode for load balancing
// API Server - Fork mode (tsx incompatible with cluster)
name: 'flyer-crawler-api',
script: './node_modules/.bin/tsx',
args: 'server.ts',
max_memory_restart: '500M',
instances: 'max', // Use all CPU cores
exec_mode: 'cluster', // Enable cluster mode
instances: 1, // Fork mode - single instance
exec_mode: 'fork', // tsx requires fork mode
kill_timeout: 5000, // Graceful shutdown timeout
// Restart configuration
@@ -358,6 +358,42 @@ podman-compose -f compose.dev.yml build app
**Rationale**: Developers and CI systems should never need to run manual setup commands to execute tests. If the container is running, tests should work. Any deviation from this principle indicates an incomplete container setup.
## Updates
### 2026-02-19: Cluster Mode Disabled
**Decision:** Disabled PM2 cluster mode in favor of fork mode (single instance).
**Reason:** PM2 cluster mode is fundamentally incompatible with using `tsx` as the script path. The configuration pattern:
```javascript
{
script: './node_modules/.bin/tsx',
args: 'server.ts',
exec_mode: 'cluster',
}
```
causes 75-87% of cluster instances to fail on startup with no clear error messages. Only 1-2 out of 8 instances successfully start, with the rest showing constant restart attempts.
**Technical Root Cause:** PM2 requires the `node` binary as the interpreter to properly fork cluster workers using Node.js's native cluster module. When `tsx` is the script path, PM2 cannot create cluster workers correctly.
**Alternative:** To use cluster mode with TypeScript in the future, the correct configuration is:
```javascript
{
script: 'server.ts',
interpreter: 'node',
interpreter_args: '--import tsx', // Node 18.19+
exec_mode: 'cluster',
instances: 'max',
}
```
**Impact:** Current traffic does not require cluster mode load balancing. Fork mode with a single instance provides reliable process management without the cluster overhead.
**Documentation:** See [PM2-CLUSTER-MODE-INCOMPATIBILITY.md](../operations/PM2-CLUSTER-MODE-INCOMPATIBILITY.md) for full details.
## Related ADRs
- [ADR-017](./0017-ci-cd-and-branching-strategy.md) - CI/CD Strategy

View File

@@ -0,0 +1,128 @@
# PM2 Cluster Mode Incompatibility with tsx
**Date Documented:** 2026-02-19
**Affected Version:** v0.21.0
**Status:** Resolved by switching to fork mode
## Issue Summary
PM2 cluster mode is fundamentally incompatible with using `tsx` as the script path in the ecosystem configuration. This manifests as 6-7 out of 8 cluster instances remaining in `errored` or `stopped` state with constant restart attempts (16-17 restarts observed), while only 1-2 instances successfully start.
## Root Cause
The configuration pattern used in [ecosystem.config.cjs:68-73](../../ecosystem.config.cjs#L68-L73):
```javascript
{
script: './node_modules/.bin/tsx',
args: 'server.ts',
exec_mode: 'cluster',
instances: 'max',
}
```
**This is incompatible with PM2's cluster mode** because PM2 requires the `node` binary as the interpreter to properly fork worker processes using Node.js's native cluster module. When `tsx` is used as the script, PM2 cannot create cluster workers correctly.
## Technical Explanation
PM2's cluster mode uses Node.js's built-in `cluster` module to spawn multiple worker processes that share the same server port. This architecture requires:
1. **The interpreter must be `node`** - PM2 forks the main process using `node` as the binary
2. **TypeScript transpilation must happen via Node.js loaders** - tsx or ts-node must be loaded as a Node.js module loader, not as the executable
When `tsx` is the script path (not the interpreter), PM2 attempts to treat it as the application entry point, which breaks the cluster fork mechanism. This causes worker processes to fail on startup with no clear error messages in logs (as observed: empty log files despite 16-17 restart attempts).
## The Correct Way to Use tsx with Cluster Mode
If cluster mode is required in the future, the proper configuration is:
```javascript
{
name: 'flyer-crawler-api',
script: 'server.ts', // Direct TypeScript file
interpreter: 'node', // Use node as interpreter
interpreter_args: '--import tsx', // tsx as Node.js loader (Node 18.19+)
exec_mode: 'cluster',
instances: 'max',
}
```
**Important:** For Node.js 18.18 and below, use `--loader tsx` instead of `--import tsx`.
This configuration:
- Uses `node` as the interpreter (required for cluster mode)
- Loads `tsx` as a Node.js module loader via `--import`
- Allows PM2 to properly fork cluster workers
## Current Resolution
Since the application does not have high enough traffic to require cluster mode load balancing, we switched to **fork mode** with a single instance:
```javascript
{
instances: 1,
exec_mode: 'fork',
}
```
This provides:
- ✅ Reliable process startup
- ✅ Proper TypeScript execution via tsx
- ✅ PM2 process management (restarts, logs, monitoring)
- ❌ No load balancing across CPU cores (acceptable for current traffic)
## When to Reconsider Cluster Mode
Cluster mode should be reconsidered when:
1. **Traffic increases significantly** - Multiple CPU cores needed for request handling
2. **Zero-downtime deploys are required** - PM2 reload works only in cluster mode
3. **Configuration is updated** - Use the correct `interpreter` + `interpreter_args` pattern above
## Diagnostic Commands Used
```bash
# Check PM2 process status
pm2 ps
# View logs for failing instances
pm2 logs flyer-crawler-api --namespace flyer-crawler-prod --lines 100 --nostream
# Get detailed process information
pm2 describe 10 --namespace flyer-crawler-prod
# Check Node.js and PM2 versions
node --version # v22.22.0
pm2 --version # v6.0.13
```
## Evidence from Production
```
┌────┬────────────────────────────────────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┐
│ id │ name │ namespace │ version │ mode │ pid │ uptime │ ↺ │ status │
├────┼────────────────────────────────────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┤
│ 10 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2032773 │ 6m │ 0 │ online │
│ 13 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044125 │ 0 │ 16 │ errored │
│ 14 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044198 │ 0 │ 16 │ errored │
│ 15 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044221 │ 0 │ 16 │ errored │
│ 16 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044179 │ 0 │ 16 │ errored │
│ 17 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044358 │ 12s │ 17 │ online │
│ 18 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044100 │ 0 │ 16 │ errored │
│ 19 │ flyer-crawler-api │ flyer-craw… │ 0.21.0 │ cluster │ 2044243 │ 0 │ 17 │ errored │
└────┴────────────────────────────────────────┴─────────────┴─────────┴─────────┴──────────┴────────┴──────┴───────────┘
```
Only 2 out of 8 cluster instances were online; the rest showed 16-17 restart attempts with empty log files.
## References
- [PM2 — Use TSX to Start Your App](https://futurestud.io/tutorials/pm2-use-tsx-to-start-your-app)
- [Running typescript app with pm2 and tsx](https://blog.vramana.com/posts/2023-02-05-pm2-tsx/)
- [PM2 and cluster mode in Node.js/TypeScript · Issue #5790](https://github.com/Unitech/pm2/issues/5790)
- [PM2 - Cluster Mode Documentation](https://pm2.keymetrics.io/docs/usage/cluster-mode/)
- [PM2 - Transpilers Integration](https://pm2.io/docs/runtime/integration/transpilers/)
## Related ADRs
- [ADR-014: Containerization and Deployment Strategy](../adr/0014-containerization-and-deployment-strategy.md) - Original cluster mode decision
- [ADR-063: PM2 Namespace Implementation](../adr/0063-pm2-namespace-implementation.md) - PM2 namespace isolation

View File

@@ -69,8 +69,8 @@ module.exports = {
args: 'server.ts',
cwd: '/var/www/flyer-crawler.projectium.com',
max_memory_restart: '500M',
instances: 'max',
exec_mode: 'cluster',
instances: 1,
exec_mode: 'fork',
kill_timeout: 5000,
log_date_format: 'YYYY-MM-DD HH:mm:ss Z',
max_restarts: 40,