Files
flyer-crawler.projectium.com/docs/subagents/DEVOPS-GUIDE.md
Torben Sorensen 4f06698dfd
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 2m50s
test fixes and doc work
2026-01-28 15:33:48 -08:00

16 KiB

DevOps Subagent Guide

This guide covers DevOps-related subagents for deployment, infrastructure, and operations:

  • devops: Containers, services, CI/CD pipelines, deployments
  • infra-architect: Resource optimization, capacity planning
  • bg-worker: Background jobs, PM2 workers, BullMQ queues

CRITICAL: Server Access Model

Claude Code has READ-ONLY access to production/test servers.

The claude-win10 user cannot execute write operations (PM2 restart, systemctl, file modifications) directly on servers. The devops subagent must provide commands for the user to execute, not attempt to run them via SSH.

Command Delegation Workflow

When troubleshooting or making changes to production/test servers:

Phase Actor Action
Diagnose Claude Provide read-only diagnostic commands
Report User Execute commands, share output with Claude
Analyze Claude Interpret results, identify root cause
Fix Claude Provide 1-3 fix commands (never more, errors may cascade)
Execute User Run fix commands, report results
Verify Claude Provide verification commands to confirm success
Document Claude Update relevant documentation with findings and resolutions

Example: PM2 Process Issue

Step 1 - Diagnostic Commands (Claude provides, user runs):

# Check PM2 process status
pm2 list

# View recent error logs
pm2 logs flyer-crawler-api --err --lines 50

# Check system resources
free -h
df -h /var/www

Step 2 - User reports output to Claude

Step 3 - Fix Commands (Claude provides 1-3 at a time):

# Restart the failing process
pm2 restart flyer-crawler-api

Step 4 - User executes and reports result

Step 5 - Verification Commands:

# Confirm process is running
pm2 list

# Test API health
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq .

What NOT to Do

# WRONG - Claude cannot execute this directly
ssh root@projectium.com "pm2 restart all"

# WRONG - Providing too many commands at once
pm2 stop all && rm -rf node_modules && npm install && pm2 start all

# WRONG - Assuming commands succeeded without user confirmation

The devops Subagent

When to Use

Use the devops subagent when you need to:

  • Debug container issues in development
  • Modify CI/CD pipelines
  • Configure PM2 for production
  • Update deployment workflows
  • Troubleshoot service startup issues
  • Configure NGINX or reverse proxy
  • Set up SSL/TLS certificates

What devops Knows

The devops subagent understands:

  • Podman/Docker container management
  • Dev container configuration (.devcontainer/)
  • Compose files (compose.dev.yml)
  • PM2 ecosystem configuration
  • Gitea Actions CI/CD workflows
  • NGINX configuration
  • Systemd service management

Development Environment

Container Architecture:

┌─────────────────────────────────────────────────────────────┐
│                    Development Environment                   │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │     app     │    │   postgres  │    │    redis    │     │
│  │  (Node.js)  │───►│  (PostGIS)  │    │   (Cache)   │     │
│  │             │───►│             │    │             │     │
│  └─────────────┘    └─────────────┘    └─────────────┘     │
│     :3000/:3001         :5432              :6379           │
└─────────────────────────────────────────────────────────────┘

Container Services:

Service Image Purpose Port
app Custom (Dockerfile.dev) Node.js application 3000, 3001
postgres postgis/postgis:15-3.4 Database with PostGIS 5432
redis redis:alpine Caching and job queues 6379

Example Requests

Container debugging:

"Use devops to debug why the dev container fails to start.
The postgres service shows as unhealthy and the app can't connect."

CI/CD pipeline update:

"Use devops to add a step to the deploy-to-test.yml workflow
that runs database migrations before restarting the app."

PM2 configuration:

"Use devops to update the PM2 ecosystem config to use cluster
mode with 4 instances instead of max for the API server."

Container Commands Reference

# Start development environment
podman-compose -f compose.dev.yml up -d

# View container logs
podman-compose -f compose.dev.yml logs -f app

# Restart specific service
podman-compose -f compose.dev.yml restart app

# Rebuild container (after Dockerfile changes)
podman-compose -f compose.dev.yml build app

# Reset everything
podman-compose -f compose.dev.yml down -v
podman-compose -f compose.dev.yml up -d --build

# Enter container shell
podman exec -it flyer-crawler-dev bash

# Run tests in container (from Windows)
podman exec -it flyer-crawler-dev npm run test:unit

Git Bash Path Conversion (Windows)

When running commands from Git Bash on Windows, paths may be incorrectly converted:

Solution Example
sh -c with single quotes podman exec container sh -c '/usr/local/bin/script.sh'
Double slashes podman exec container //usr//local//bin//script.sh
MSYS_NO_PATHCONV=1 MSYS_NO_PATHCONV=1 podman exec ...

PM2 Production Configuration

ecosystem.config.cjs Structure:

module.exports = {
  apps: [
    {
      name: 'flyer-crawler-api',
      script: './node_modules/.bin/tsx',
      args: 'server.ts',
      instances: 'max', // Use all CPU cores
      exec_mode: 'cluster', // Enable cluster mode
      max_memory_restart: '500M',
      kill_timeout: 5000, // Graceful shutdown

      env_production: {
        NODE_ENV: 'production',
        cwd: '/var/www/flyer-crawler.projectium.com',
      },
    },
    {
      name: 'flyer-crawler-worker',
      script: './node_modules/.bin/tsx',
      args: 'src/services/worker.ts',
      instances: 1, // Single instance for workers
      max_memory_restart: '1G',
      kill_timeout: 10000, // Workers need more time
    },
  ],
};

PM2 Commands:

# Start/reload with environment
pm2 startOrReload ecosystem.config.cjs --env production --update-env

# Save process list
pm2 save

# View logs
pm2 logs flyer-crawler-api --lines 50

# Monitor processes
pm2 monit

# Describe process
pm2 describe flyer-crawler-api

CI/CD Workflow Files

File Purpose
.gitea/workflows/deploy-to-prod.yml Production deployment
.gitea/workflows/deploy-to-test.yml Test environment deployment

Deployment Flow:

  1. Push to main branch
  2. Gitea Actions triggered
  3. SSH to production server
  4. Pull latest code
  5. Install dependencies
  6. Run build
  7. Run migrations
  8. Restart PM2 processes

Directory Structure (Production)

/var/www/
├── flyer-crawler.projectium.com/          # Production
│   ├── server.ts
│   ├── ecosystem.config.cjs
│   ├── package.json
│   ├── flyer-images/
│   │   ├── icons/
│   │   └── archive/
│   └── logs/
│       └── app.log
└── flyer-crawler-test.projectium.com/     # Test environment
    └── ... (same structure)

The infra-architect Subagent

When to Use

Use the infra-architect subagent when you need to:

  • Analyze resource usage and optimize
  • Plan for scaling
  • Reduce infrastructure costs
  • Configure memory limits
  • Analyze disk usage
  • Plan capacity for growth

What infra-architect Knows

The infra-architect subagent understands:

  • Node.js memory management
  • PostgreSQL resource tuning
  • Redis memory configuration
  • Container resource limits
  • PM2 process monitoring
  • Disk and storage management

Example Requests

Memory optimization:

"Use infra-architect to analyze memory usage of the worker
processes. They're frequently hitting the 1GB limit and restarting."

Capacity planning:

"Use infra-architect to estimate resource requirements for
handling 10x current traffic. Include database, Redis, and
application server recommendations."

Cost optimization:

"Use infra-architect to identify opportunities to reduce
infrastructure costs without impacting performance."

Resource Limits Reference

Process Memory Limit Notes
API Server 500MB Per cluster instance
Worker 1GB Single instance
Analytics Worker 1GB Single instance
PostgreSQL System RAM Tune shared_buffers
Redis 256MB maxmemory setting

The bg-worker Subagent

When to Use

Use the bg-worker subagent when you need to:

  • Debug BullMQ queue issues
  • Add new background job types
  • Configure job retry logic
  • Analyze job processing failures
  • Optimize worker performance
  • Handle job timeouts

What bg-worker Knows

The bg-worker subagent understands:

  • BullMQ queue patterns
  • PM2 worker configuration
  • Job retry and backoff strategies
  • Queue monitoring and debugging
  • Redis connection for queues
  • Worker health checks (ADR-053)

Queue Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   API Server    │───►│  Redis (BullMQ) │◄───│     Worker      │
│                 │    │                 │    │                 │
│  queue.add()    │    │ flyerQueue      │    │ process jobs    │
│                 │    │ cleanupQueue    │    │                 │
└─────────────────┘    │ analyticsQueue  │    └─────────────────┘
                       └─────────────────┘

Example Requests

Debugging stuck jobs:

"Use bg-worker to debug why jobs are stuck in the flyer processing
queue. Check for failed jobs, worker status, and Redis connectivity."

Adding retry logic:

"Use bg-worker to add exponential backoff retry logic to the
AI extraction job. It should retry up to 3 times with increasing
delays for rate limit errors."

Queue monitoring:

"Use bg-worker to add health check endpoints for monitoring
queue depth and worker status."

Queue Configuration

// src/services/queues.server.ts
export const flyerQueue = new Queue('flyer-processing', {
  connection: redisConnection,
  defaultJobOptions: {
    attempts: 3,
    backoff: {
      type: 'exponential',
      delay: 1000,
    },
    removeOnComplete: { count: 100 },
    removeOnFail: { count: 1000 },
  },
});

Worker Configuration

// src/services/workers.server.ts
export const flyerWorker = new Worker(
  'flyer-processing',
  async (job) => {
    // Process job
  },
  {
    connection: redisConnection,
    concurrency: 5,
    limiter: {
      max: 10,
      duration: 1000,
    },
  },
);

Monitoring Queues

# Check queue status via Redis
redis-cli -a $REDIS_PASSWORD

> KEYS bull:*
> LLEN bull:flyer-processing:wait
> ZRANGE bull:flyer-processing:failed 0 -1

Service Management Commands

Note

: These commands are for the user to execute on the server. Claude Code provides these commands but cannot run them directly due to read-only server access. See Server Access Model above.

PM2 Commands

# Start/reload
pm2 startOrReload ecosystem.config.cjs --env production --update-env && pm2 save

# View status
pm2 list
pm2 status

# View logs
pm2 logs
pm2 logs flyer-crawler-api --lines 100

# Restart specific process
pm2 restart flyer-crawler-api
pm2 restart flyer-crawler-worker

# Stop all
pm2 stop all

# Delete all
pm2 delete all

Systemd Services (Production)

Service Command
PostgreSQL `sudo systemctl {start stop status} postgresql`
Redis `sudo systemctl {start stop status} redis-server`
NGINX `sudo systemctl {start stop status} nginx`
Bugsink `sudo systemctl {start stop status} gunicorn-bugsink`
Logstash `sudo systemctl {start stop status} logstash`

Health Checks

# API health check
curl http://localhost:3001/api/health

# PM2 health
pm2 list

# PostgreSQL health
pg_isready -h localhost -p 5432

# Redis health
redis-cli -a $REDIS_PASSWORD ping

Troubleshooting Guide

Container Won't Start

  1. Check container logs: podman-compose logs app
  2. Verify services are healthy: podman-compose ps
  3. Check environment variables in compose.dev.yml
  4. Try rebuilding: podman-compose build --no-cache app

Tests Fail in Container but Pass Locally

Tests must run in the Linux container environment:

# Wrong (Windows)
npm test

# Correct (in container)
podman exec -it flyer-crawler-dev npm test

PM2 Process Keeps Restarting

  1. Check logs: pm2 logs <process-name>
  2. Check memory usage: pm2 monit
  3. Verify environment variables: pm2 env <process-id>
  4. Check for unhandled errors in application code

Database Connection Refused

  1. Verify PostgreSQL is running
  2. Check connection string in environment
  3. Verify database user has permissions
  4. Check pg_hba.conf for allowed connections

Redis Connection Issues

  1. Verify Redis is running: redis-cli ping
  2. Check password in environment variables
  3. Verify Redis is listening on expected port
  4. Check maxmemory setting if queue operations fail