16 KiB
DevOps Subagent Guide
This guide covers DevOps-related subagents for deployment, infrastructure, and operations:
- devops: Containers, services, CI/CD pipelines, deployments
- infra-architect: Resource optimization, capacity planning
- bg-worker: Background jobs, PM2 workers, BullMQ queues
CRITICAL: Server Access Model
Claude Code has READ-ONLY access to production/test servers.
The claude-win10 user cannot execute write operations (PM2 restart, systemctl, file modifications) directly on servers. The devops subagent must provide commands for the user to execute, not attempt to run them via SSH.
Command Delegation Workflow
When troubleshooting or making changes to production/test servers:
| Phase | Actor | Action |
|---|---|---|
| Diagnose | Claude | Provide read-only diagnostic commands |
| Report | User | Execute commands, share output with Claude |
| Analyze | Claude | Interpret results, identify root cause |
| Fix | Claude | Provide 1-3 fix commands (never more, errors may cascade) |
| Execute | User | Run fix commands, report results |
| Verify | Claude | Provide verification commands to confirm success |
| Document | Claude | Update relevant documentation with findings and resolutions |
Example: PM2 Process Issue
Step 1 - Diagnostic Commands (Claude provides, user runs):
# Check PM2 process status
pm2 list
# View recent error logs
pm2 logs flyer-crawler-api --err --lines 50
# Check system resources
free -h
df -h /var/www
Step 2 - User reports output to Claude
Step 3 - Fix Commands (Claude provides 1-3 at a time):
# Restart the failing process
pm2 restart flyer-crawler-api
Step 4 - User executes and reports result
Step 5 - Verification Commands:
# Confirm process is running
pm2 list
# Test API health
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq .
What NOT to Do
# WRONG - Claude cannot execute this directly
ssh root@projectium.com "pm2 restart all"
# WRONG - Providing too many commands at once
pm2 stop all && rm -rf node_modules && npm install && pm2 start all
# WRONG - Assuming commands succeeded without user confirmation
The devops Subagent
When to Use
Use the devops subagent when you need to:
- Debug container issues in development
- Modify CI/CD pipelines
- Configure PM2 for production
- Update deployment workflows
- Troubleshoot service startup issues
- Configure NGINX or reverse proxy
- Set up SSL/TLS certificates
What devops Knows
The devops subagent understands:
- Podman/Docker container management
- Dev container configuration (
.devcontainer/) - Compose files (
compose.dev.yml) - PM2 ecosystem configuration
- Gitea Actions CI/CD workflows
- NGINX configuration
- Systemd service management
Development Environment
Container Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Development Environment │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ app │ │ postgres │ │ redis │ │
│ │ (Node.js) │───►│ (PostGIS) │ │ (Cache) │ │
│ │ │───►│ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ :3000/:3001 :5432 :6379 │
└─────────────────────────────────────────────────────────────┘
Container Services:
| Service | Image | Purpose | Port |
|---|---|---|---|
app |
Custom (Dockerfile.dev) | Node.js application | 3000, 3001 |
postgres |
postgis/postgis:15-3.4 | Database with PostGIS | 5432 |
redis |
redis:alpine | Caching and job queues | 6379 |
Example Requests
Container debugging:
"Use devops to debug why the dev container fails to start.
The postgres service shows as unhealthy and the app can't connect."
CI/CD pipeline update:
"Use devops to add a step to the deploy-to-test.yml workflow
that runs database migrations before restarting the app."
PM2 configuration:
"Use devops to update the PM2 ecosystem config to use cluster
mode with 4 instances instead of max for the API server."
Container Commands Reference
# Start development environment
podman-compose -f compose.dev.yml up -d
# View container logs
podman-compose -f compose.dev.yml logs -f app
# Restart specific service
podman-compose -f compose.dev.yml restart app
# Rebuild container (after Dockerfile changes)
podman-compose -f compose.dev.yml build app
# Reset everything
podman-compose -f compose.dev.yml down -v
podman-compose -f compose.dev.yml up -d --build
# Enter container shell
podman exec -it flyer-crawler-dev bash
# Run tests in container (from Windows)
podman exec -it flyer-crawler-dev npm run test:unit
Git Bash Path Conversion (Windows)
When running commands from Git Bash on Windows, paths may be incorrectly converted:
| Solution | Example |
|---|---|
sh -c with single quotes |
podman exec container sh -c '/usr/local/bin/script.sh' |
| Double slashes | podman exec container //usr//local//bin//script.sh |
| MSYS_NO_PATHCONV=1 | MSYS_NO_PATHCONV=1 podman exec ... |
PM2 Production Configuration
ecosystem.config.cjs Structure:
module.exports = {
apps: [
{
name: 'flyer-crawler-api',
script: './node_modules/.bin/tsx',
args: 'server.ts',
instances: 'max', // Use all CPU cores
exec_mode: 'cluster', // Enable cluster mode
max_memory_restart: '500M',
kill_timeout: 5000, // Graceful shutdown
env_production: {
NODE_ENV: 'production',
cwd: '/var/www/flyer-crawler.projectium.com',
},
},
{
name: 'flyer-crawler-worker',
script: './node_modules/.bin/tsx',
args: 'src/services/worker.ts',
instances: 1, // Single instance for workers
max_memory_restart: '1G',
kill_timeout: 10000, // Workers need more time
},
],
};
PM2 Commands:
# Start/reload with environment
pm2 startOrReload ecosystem.config.cjs --env production --update-env
# Save process list
pm2 save
# View logs
pm2 logs flyer-crawler-api --lines 50
# Monitor processes
pm2 monit
# Describe process
pm2 describe flyer-crawler-api
CI/CD Workflow Files
| File | Purpose |
|---|---|
.gitea/workflows/deploy-to-prod.yml |
Production deployment |
.gitea/workflows/deploy-to-test.yml |
Test environment deployment |
Deployment Flow:
- Push to
mainbranch - Gitea Actions triggered
- SSH to production server
- Pull latest code
- Install dependencies
- Run build
- Run migrations
- Restart PM2 processes
Directory Structure (Production)
/var/www/
├── flyer-crawler.projectium.com/ # Production
│ ├── server.ts
│ ├── ecosystem.config.cjs
│ ├── package.json
│ ├── flyer-images/
│ │ ├── icons/
│ │ └── archive/
│ └── logs/
│ └── app.log
└── flyer-crawler-test.projectium.com/ # Test environment
└── ... (same structure)
The infra-architect Subagent
When to Use
Use the infra-architect subagent when you need to:
- Analyze resource usage and optimize
- Plan for scaling
- Reduce infrastructure costs
- Configure memory limits
- Analyze disk usage
- Plan capacity for growth
What infra-architect Knows
The infra-architect subagent understands:
- Node.js memory management
- PostgreSQL resource tuning
- Redis memory configuration
- Container resource limits
- PM2 process monitoring
- Disk and storage management
Example Requests
Memory optimization:
"Use infra-architect to analyze memory usage of the worker
processes. They're frequently hitting the 1GB limit and restarting."
Capacity planning:
"Use infra-architect to estimate resource requirements for
handling 10x current traffic. Include database, Redis, and
application server recommendations."
Cost optimization:
"Use infra-architect to identify opportunities to reduce
infrastructure costs without impacting performance."
Resource Limits Reference
| Process | Memory Limit | Notes |
|---|---|---|
| API Server | 500MB | Per cluster instance |
| Worker | 1GB | Single instance |
| Analytics Worker | 1GB | Single instance |
| PostgreSQL | System RAM | Tune shared_buffers |
| Redis | 256MB | maxmemory setting |
The bg-worker Subagent
When to Use
Use the bg-worker subagent when you need to:
- Debug BullMQ queue issues
- Add new background job types
- Configure job retry logic
- Analyze job processing failures
- Optimize worker performance
- Handle job timeouts
What bg-worker Knows
The bg-worker subagent understands:
- BullMQ queue patterns
- PM2 worker configuration
- Job retry and backoff strategies
- Queue monitoring and debugging
- Redis connection for queues
- Worker health checks (ADR-053)
Queue Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ API Server │───►│ Redis (BullMQ) │◄───│ Worker │
│ │ │ │ │ │
│ queue.add() │ │ flyerQueue │ │ process jobs │
│ │ │ cleanupQueue │ │ │
└─────────────────┘ │ analyticsQueue │ └─────────────────┘
└─────────────────┘
Example Requests
Debugging stuck jobs:
"Use bg-worker to debug why jobs are stuck in the flyer processing
queue. Check for failed jobs, worker status, and Redis connectivity."
Adding retry logic:
"Use bg-worker to add exponential backoff retry logic to the
AI extraction job. It should retry up to 3 times with increasing
delays for rate limit errors."
Queue monitoring:
"Use bg-worker to add health check endpoints for monitoring
queue depth and worker status."
Queue Configuration
// src/services/queues.server.ts
export const flyerQueue = new Queue('flyer-processing', {
connection: redisConnection,
defaultJobOptions: {
attempts: 3,
backoff: {
type: 'exponential',
delay: 1000,
},
removeOnComplete: { count: 100 },
removeOnFail: { count: 1000 },
},
});
Worker Configuration
// src/services/workers.server.ts
export const flyerWorker = new Worker(
'flyer-processing',
async (job) => {
// Process job
},
{
connection: redisConnection,
concurrency: 5,
limiter: {
max: 10,
duration: 1000,
},
},
);
Monitoring Queues
# Check queue status via Redis
redis-cli -a $REDIS_PASSWORD
> KEYS bull:*
> LLEN bull:flyer-processing:wait
> ZRANGE bull:flyer-processing:failed 0 -1
Service Management Commands
Note
: These commands are for the user to execute on the server. Claude Code provides these commands but cannot run them directly due to read-only server access. See Server Access Model above.
PM2 Commands
# Start/reload
pm2 startOrReload ecosystem.config.cjs --env production --update-env && pm2 save
# View status
pm2 list
pm2 status
# View logs
pm2 logs
pm2 logs flyer-crawler-api --lines 100
# Restart specific process
pm2 restart flyer-crawler-api
pm2 restart flyer-crawler-worker
# Stop all
pm2 stop all
# Delete all
pm2 delete all
Systemd Services (Production)
| Service | Command | ||
|---|---|---|---|
| PostgreSQL | `sudo systemctl {start | stop | status} postgresql` |
| Redis | `sudo systemctl {start | stop | status} redis-server` |
| NGINX | `sudo systemctl {start | stop | status} nginx` |
| Bugsink | `sudo systemctl {start | stop | status} gunicorn-bugsink` |
| Logstash | `sudo systemctl {start | stop | status} logstash` |
Health Checks
# API health check
curl http://localhost:3001/api/health
# PM2 health
pm2 list
# PostgreSQL health
pg_isready -h localhost -p 5432
# Redis health
redis-cli -a $REDIS_PASSWORD ping
Troubleshooting Guide
Container Won't Start
- Check container logs:
podman-compose logs app - Verify services are healthy:
podman-compose ps - Check environment variables in
compose.dev.yml - Try rebuilding:
podman-compose build --no-cache app
Tests Fail in Container but Pass Locally
Tests must run in the Linux container environment:
# Wrong (Windows)
npm test
# Correct (in container)
podman exec -it flyer-crawler-dev npm test
PM2 Process Keeps Restarting
- Check logs:
pm2 logs <process-name> - Check memory usage:
pm2 monit - Verify environment variables:
pm2 env <process-id> - Check for unhandled errors in application code
Database Connection Refused
- Verify PostgreSQL is running
- Check connection string in environment
- Verify database user has permissions
- Check
pg_hba.conffor allowed connections
Redis Connection Issues
- Verify Redis is running:
redis-cli ping - Check password in environment variables
- Verify Redis is listening on expected port
- Check
maxmemorysetting if queue operations fail
Related Documentation
- OVERVIEW.md - Subagent system overview
- ../BARE-METAL-SETUP.md - Production setup guide
- ../adr/0014-containerization-and-deployment-strategy.md - Containerization ADR
- ../adr/0006-background-job-processing-and-task-queues.md - Background jobs ADR
- ../adr/0017-ci-cd-and-branching-strategy.md - CI/CD strategy
- ../adr/0053-worker-health-checks.md - Worker health checks