# DevOps Subagent Guide This guide covers DevOps-related subagents for deployment, infrastructure, and operations: - **devops**: Containers, services, CI/CD pipelines, deployments - **infra-architect**: Resource optimization, capacity planning - **bg-worker**: Background jobs, PM2 workers, BullMQ queues --- ## CRITICAL: Server Access Model **Claude Code has READ-ONLY access to production/test servers.** The `claude-win10` user cannot execute write operations (PM2 restart, systemctl, file modifications) directly on servers. The devops subagent must **provide commands for the user to execute**, not attempt to run them via SSH. ### Command Delegation Workflow When troubleshooting or making changes to production/test servers: | Phase | Actor | Action | | -------- | ------ | ----------------------------------------------------------- | | Diagnose | Claude | Provide read-only diagnostic commands | | Report | User | Execute commands, share output with Claude | | Analyze | Claude | Interpret results, identify root cause | | Fix | Claude | Provide 1-3 fix commands (never more, errors may cascade) | | Execute | User | Run fix commands, report results | | Verify | Claude | Provide verification commands to confirm success | | Document | Claude | Update relevant documentation with findings and resolutions | ### Example: PM2 Process Issue Step 1 - Diagnostic Commands (Claude provides, user runs): ```bash # Check PM2 process status pm2 list # View recent error logs pm2 logs flyer-crawler-api --err --lines 50 # Check system resources free -h df -h /var/www ``` Step 2 - User reports output to Claude Step 3 - Fix Commands (Claude provides 1-3 at a time): ```bash # Restart the failing process pm2 restart flyer-crawler-api ``` Step 4 - User executes and reports result Step 5 - Verification Commands: ```bash # Confirm process is running pm2 list # Test API health curl -s https://flyer-crawler.projectium.com/api/health/ready | jq . ``` ### What NOT to Do ```bash # WRONG - Claude cannot execute this directly ssh root@projectium.com "pm2 restart all" # WRONG - Providing too many commands at once pm2 stop all && rm -rf node_modules && npm install && pm2 start all # WRONG - Assuming commands succeeded without user confirmation ``` --- ## The devops Subagent ### When to Use Use the **devops** subagent when you need to: - Debug container issues in development - Modify CI/CD pipelines - Configure PM2 for production - Update deployment workflows - Troubleshoot service startup issues - Configure NGINX or reverse proxy - Set up SSL/TLS certificates ### What devops Knows The devops subagent understands: - Podman/Docker container management - Dev container configuration (`.devcontainer/`) - Compose files (`compose.dev.yml`) - PM2 ecosystem configuration - Gitea Actions CI/CD workflows - NGINX configuration - Systemd service management ### Development Environment **Container Architecture:** ``` ┌─────────────────────────────────────────────────────────────┐ │ Development Environment │ ├─────────────────────────────────────────────────────────────┤ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ app │ │ postgres │ │ redis │ │ │ │ (Node.js) │───►│ (PostGIS) │ │ (Cache) │ │ │ │ │───►│ │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ :3000/:3001 :5432 :6379 │ └─────────────────────────────────────────────────────────────┘ ``` **Container Services:** | Service | Image | Purpose | Port | | ---------- | ----------------------- | ---------------------- | ---------- | | `app` | Custom (Dockerfile.dev) | Node.js application | 3000, 3001 | | `postgres` | postgis/postgis:15-3.4 | Database with PostGIS | 5432 | | `redis` | redis:alpine | Caching and job queues | 6379 | ### Example Requests **Container debugging:** ``` "Use devops to debug why the dev container fails to start. The postgres service shows as unhealthy and the app can't connect." ``` **CI/CD pipeline update:** ``` "Use devops to add a step to the deploy-to-test.yml workflow that runs database migrations before restarting the app." ``` **PM2 configuration:** ``` "Use devops to update the PM2 ecosystem config to use cluster mode with 4 instances instead of max for the API server." ``` ### Container Commands Reference ```bash # Start development environment podman-compose -f compose.dev.yml up -d # View container logs podman-compose -f compose.dev.yml logs -f app # Restart specific service podman-compose -f compose.dev.yml restart app # Rebuild container (after Dockerfile changes) podman-compose -f compose.dev.yml build app # Reset everything podman-compose -f compose.dev.yml down -v podman-compose -f compose.dev.yml up -d --build # Enter container shell podman exec -it flyer-crawler-dev bash # Run tests in container (from Windows) podman exec -it flyer-crawler-dev npm run test:unit ``` ### Git Bash Path Conversion (Windows) When running commands from Git Bash on Windows, paths may be incorrectly converted: | Solution | Example | | -------------------------- | -------------------------------------------------------- | | `sh -c` with single quotes | `podman exec container sh -c '/usr/local/bin/script.sh'` | | Double slashes | `podman exec container //usr//local//bin//script.sh` | | MSYS_NO_PATHCONV=1 | `MSYS_NO_PATHCONV=1 podman exec ...` | ### PM2 Production Configuration **ecosystem.config.cjs Structure:** ```javascript module.exports = { apps: [ { name: 'flyer-crawler-api', script: './node_modules/.bin/tsx', args: 'server.ts', instances: 'max', // Use all CPU cores exec_mode: 'cluster', // Enable cluster mode max_memory_restart: '500M', kill_timeout: 5000, // Graceful shutdown env_production: { NODE_ENV: 'production', cwd: '/var/www/flyer-crawler.projectium.com', }, }, { name: 'flyer-crawler-worker', script: './node_modules/.bin/tsx', args: 'src/services/worker.ts', instances: 1, // Single instance for workers max_memory_restart: '1G', kill_timeout: 10000, // Workers need more time }, ], }; ``` **PM2 Commands:** ```bash # Start/reload with environment pm2 startOrReload ecosystem.config.cjs --env production --update-env # Save process list pm2 save # View logs pm2 logs flyer-crawler-api --lines 50 # Monitor processes pm2 monit # Describe process pm2 describe flyer-crawler-api ``` ### CI/CD Workflow Files | File | Purpose | | ------------------------------------- | --------------------------- | | `.gitea/workflows/deploy-to-prod.yml` | Production deployment | | `.gitea/workflows/deploy-to-test.yml` | Test environment deployment | **Deployment Flow:** 1. Push to `main` branch 2. Gitea Actions triggered 3. SSH to production server 4. Pull latest code 5. Install dependencies 6. Run build 7. Run migrations 8. Restart PM2 processes ### Directory Structure (Production) ``` /var/www/ ├── flyer-crawler.projectium.com/ # Production │ ├── server.ts │ ├── ecosystem.config.cjs │ ├── package.json │ ├── flyer-images/ │ │ ├── icons/ │ │ └── archive/ │ └── logs/ │ └── app.log └── flyer-crawler-test.projectium.com/ # Test environment └── ... (same structure) ``` ## The infra-architect Subagent ### When to Use Use the **infra-architect** subagent when you need to: - Analyze resource usage and optimize - Plan for scaling - Reduce infrastructure costs - Configure memory limits - Analyze disk usage - Plan capacity for growth ### What infra-architect Knows The infra-architect subagent understands: - Node.js memory management - PostgreSQL resource tuning - Redis memory configuration - Container resource limits - PM2 process monitoring - Disk and storage management ### Example Requests **Memory optimization:** ``` "Use infra-architect to analyze memory usage of the worker processes. They're frequently hitting the 1GB limit and restarting." ``` **Capacity planning:** ``` "Use infra-architect to estimate resource requirements for handling 10x current traffic. Include database, Redis, and application server recommendations." ``` **Cost optimization:** ``` "Use infra-architect to identify opportunities to reduce infrastructure costs without impacting performance." ``` ### Resource Limits Reference | Process | Memory Limit | Notes | | ---------------- | ------------ | --------------------- | | API Server | 500MB | Per cluster instance | | Worker | 1GB | Single instance | | Analytics Worker | 1GB | Single instance | | PostgreSQL | System RAM | Tune `shared_buffers` | | Redis | 256MB | `maxmemory` setting | ## The bg-worker Subagent ### When to Use Use the **bg-worker** subagent when you need to: - Debug BullMQ queue issues - Add new background job types - Configure job retry logic - Analyze job processing failures - Optimize worker performance - Handle job timeouts ### What bg-worker Knows The bg-worker subagent understands: - BullMQ queue patterns - PM2 worker configuration - Job retry and backoff strategies - Queue monitoring and debugging - Redis connection for queues - Worker health checks (ADR-053) ### Queue Architecture ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ API Server │───►│ Redis (BullMQ) │◄───│ Worker │ │ │ │ │ │ │ │ queue.add() │ │ flyerQueue │ │ process jobs │ │ │ │ cleanupQueue │ │ │ └─────────────────┘ │ analyticsQueue │ └─────────────────┘ └─────────────────┘ ``` ### Example Requests **Debugging stuck jobs:** ``` "Use bg-worker to debug why jobs are stuck in the flyer processing queue. Check for failed jobs, worker status, and Redis connectivity." ``` **Adding retry logic:** ``` "Use bg-worker to add exponential backoff retry logic to the AI extraction job. It should retry up to 3 times with increasing delays for rate limit errors." ``` **Queue monitoring:** ``` "Use bg-worker to add health check endpoints for monitoring queue depth and worker status." ``` ### Queue Configuration ```typescript // src/services/queues.server.ts export const flyerQueue = new Queue('flyer-processing', { connection: redisConnection, defaultJobOptions: { attempts: 3, backoff: { type: 'exponential', delay: 1000, }, removeOnComplete: { count: 100 }, removeOnFail: { count: 1000 }, }, }); ``` ### Worker Configuration ```typescript // src/services/workers.server.ts export const flyerWorker = new Worker( 'flyer-processing', async (job) => { // Process job }, { connection: redisConnection, concurrency: 5, limiter: { max: 10, duration: 1000, }, }, ); ``` ### Monitoring Queues ```bash # Check queue status via Redis redis-cli -a $REDIS_PASSWORD > KEYS bull:* > LLEN bull:flyer-processing:wait > ZRANGE bull:flyer-processing:failed 0 -1 ``` ## Service Management Commands > **Note**: These commands are for the **user to execute on the server**. Claude Code provides these commands but cannot run them directly due to read-only server access. See [Server Access Model](#critical-server-access-model) above. ### PM2 Commands ```bash # Start/reload pm2 startOrReload ecosystem.config.cjs --env production --update-env && pm2 save # View status pm2 list pm2 status # View logs pm2 logs pm2 logs flyer-crawler-api --lines 100 # Restart specific process pm2 restart flyer-crawler-api pm2 restart flyer-crawler-worker # Stop all pm2 stop all # Delete all pm2 delete all ``` ### Systemd Services (Production) | Service | Command | | ---------- | ---------------------- | ---- | ------------------------- | | PostgreSQL | `sudo systemctl {start | stop | status} postgresql` | | Redis | `sudo systemctl {start | stop | status} redis-server` | | NGINX | `sudo systemctl {start | stop | status} nginx` | | Bugsink | `sudo systemctl {start | stop | status} gunicorn-bugsink` | | Logstash | `sudo systemctl {start | stop | status} logstash` | ### Health Checks ```bash # API health check curl http://localhost:3001/api/health # PM2 health pm2 list # PostgreSQL health pg_isready -h localhost -p 5432 # Redis health redis-cli -a $REDIS_PASSWORD ping ``` ## Troubleshooting Guide ### Container Won't Start 1. Check container logs: `podman-compose logs app` 2. Verify services are healthy: `podman-compose ps` 3. Check environment variables in `compose.dev.yml` 4. Try rebuilding: `podman-compose build --no-cache app` ### Tests Fail in Container but Pass Locally Tests must run in the Linux container environment: ```bash # Wrong (Windows) npm test # Correct (in container) podman exec -it flyer-crawler-dev npm test ``` ### PM2 Process Keeps Restarting 1. Check logs: `pm2 logs ` 2. Check memory usage: `pm2 monit` 3. Verify environment variables: `pm2 env ` 4. Check for unhandled errors in application code ### Database Connection Refused 1. Verify PostgreSQL is running 2. Check connection string in environment 3. Verify database user has permissions 4. Check `pg_hba.conf` for allowed connections ### Redis Connection Issues 1. Verify Redis is running: `redis-cli ping` 2. Check password in environment variables 3. Verify Redis is listening on expected port 4. Check `maxmemory` setting if queue operations fail ## Related Documentation - [OVERVIEW.md](./OVERVIEW.md) - Subagent system overview - [../BARE-METAL-SETUP.md](../BARE-METAL-SETUP.md) - Production setup guide - [../adr/0014-containerization-and-deployment-strategy.md](../adr/0014-containerization-and-deployment-strategy.md) - Containerization ADR - [../adr/0006-background-job-processing-and-task-queues.md](../adr/0006-background-job-processing-and-task-queues.md) - Background jobs ADR - [../adr/0017-ci-cd-and-branching-strategy.md](../adr/0017-ci-cd-and-branching-strategy.md) - CI/CD strategy - [../adr/0053-worker-health-checks.md](../adr/0053-worker-health-checks.md) - Worker health checks