All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 19m11s
13 KiB
13 KiB
DevOps Subagent Guide
This guide covers DevOps-related subagents for deployment, infrastructure, and operations:
- devops: Containers, services, CI/CD pipelines, deployments
- infra-architect: Resource optimization, capacity planning
- bg-worker: Background jobs, PM2 workers, BullMQ queues
The devops Subagent
When to Use
Use the devops subagent when you need to:
- Debug container issues in development
- Modify CI/CD pipelines
- Configure PM2 for production
- Update deployment workflows
- Troubleshoot service startup issues
- Configure NGINX or reverse proxy
- Set up SSL/TLS certificates
What devops Knows
The devops subagent understands:
- Podman/Docker container management
- Dev container configuration (
.devcontainer/) - Compose files (
compose.dev.yml) - PM2 ecosystem configuration
- Gitea Actions CI/CD workflows
- NGINX configuration
- Systemd service management
Development Environment
Container Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Development Environment │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ app │ │ postgres │ │ redis │ │
│ │ (Node.js) │───►│ (PostGIS) │ │ (Cache) │ │
│ │ │───►│ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ :3000/:3001 :5432 :6379 │
└─────────────────────────────────────────────────────────────┘
Container Services:
| Service | Image | Purpose | Port |
|---|---|---|---|
app |
Custom (Dockerfile.dev) | Node.js application | 3000, 3001 |
postgres |
postgis/postgis:15-3.4 | Database with PostGIS | 5432 |
redis |
redis:alpine | Caching and job queues | 6379 |
Example Requests
Container debugging:
"Use devops to debug why the dev container fails to start.
The postgres service shows as unhealthy and the app can't connect."
CI/CD pipeline update:
"Use devops to add a step to the deploy-to-test.yml workflow
that runs database migrations before restarting the app."
PM2 configuration:
"Use devops to update the PM2 ecosystem config to use cluster
mode with 4 instances instead of max for the API server."
Container Commands Reference
# Start development environment
podman-compose -f compose.dev.yml up -d
# View container logs
podman-compose -f compose.dev.yml logs -f app
# Restart specific service
podman-compose -f compose.dev.yml restart app
# Rebuild container (after Dockerfile changes)
podman-compose -f compose.dev.yml build app
# Reset everything
podman-compose -f compose.dev.yml down -v
podman-compose -f compose.dev.yml up -d --build
# Enter container shell
podman exec -it flyer-crawler-dev bash
# Run tests in container (from Windows)
podman exec -it flyer-crawler-dev npm run test:unit
Git Bash Path Conversion (Windows)
When running commands from Git Bash on Windows, paths may be incorrectly converted:
| Solution | Example |
|---|---|
sh -c with single quotes |
podman exec container sh -c '/usr/local/bin/script.sh' |
| Double slashes | podman exec container //usr//local//bin//script.sh |
| MSYS_NO_PATHCONV=1 | MSYS_NO_PATHCONV=1 podman exec ... |
PM2 Production Configuration
ecosystem.config.cjs Structure:
module.exports = {
apps: [
{
name: 'flyer-crawler-api',
script: './node_modules/.bin/tsx',
args: 'server.ts',
instances: 'max', // Use all CPU cores
exec_mode: 'cluster', // Enable cluster mode
max_memory_restart: '500M',
kill_timeout: 5000, // Graceful shutdown
env_production: {
NODE_ENV: 'production',
cwd: '/var/www/flyer-crawler.projectium.com',
},
},
{
name: 'flyer-crawler-worker',
script: './node_modules/.bin/tsx',
args: 'src/services/worker.ts',
instances: 1, // Single instance for workers
max_memory_restart: '1G',
kill_timeout: 10000, // Workers need more time
},
],
};
PM2 Commands:
# Start/reload with environment
pm2 startOrReload ecosystem.config.cjs --env production --update-env
# Save process list
pm2 save
# View logs
pm2 logs flyer-crawler-api --lines 50
# Monitor processes
pm2 monit
# Describe process
pm2 describe flyer-crawler-api
CI/CD Workflow Files
| File | Purpose |
|---|---|
.gitea/workflows/deploy-to-prod.yml |
Production deployment |
.gitea/workflows/deploy-to-test.yml |
Test environment deployment |
Deployment Flow:
- Push to
mainbranch - Gitea Actions triggered
- SSH to production server
- Pull latest code
- Install dependencies
- Run build
- Run migrations
- Restart PM2 processes
Directory Structure (Production)
/var/www/
├── flyer-crawler.projectium.com/ # Production
│ ├── server.ts
│ ├── ecosystem.config.cjs
│ ├── package.json
│ ├── flyer-images/
│ │ ├── icons/
│ │ └── archive/
│ └── logs/
│ └── app.log
└── flyer-crawler-test.projectium.com/ # Test environment
└── ... (same structure)
The infra-architect Subagent
When to Use
Use the infra-architect subagent when you need to:
- Analyze resource usage and optimize
- Plan for scaling
- Reduce infrastructure costs
- Configure memory limits
- Analyze disk usage
- Plan capacity for growth
What infra-architect Knows
The infra-architect subagent understands:
- Node.js memory management
- PostgreSQL resource tuning
- Redis memory configuration
- Container resource limits
- PM2 process monitoring
- Disk and storage management
Example Requests
Memory optimization:
"Use infra-architect to analyze memory usage of the worker
processes. They're frequently hitting the 1GB limit and restarting."
Capacity planning:
"Use infra-architect to estimate resource requirements for
handling 10x current traffic. Include database, Redis, and
application server recommendations."
Cost optimization:
"Use infra-architect to identify opportunities to reduce
infrastructure costs without impacting performance."
Resource Limits Reference
| Process | Memory Limit | Notes |
|---|---|---|
| API Server | 500MB | Per cluster instance |
| Worker | 1GB | Single instance |
| Analytics Worker | 1GB | Single instance |
| PostgreSQL | System RAM | Tune shared_buffers |
| Redis | 256MB | maxmemory setting |
The bg-worker Subagent
When to Use
Use the bg-worker subagent when you need to:
- Debug BullMQ queue issues
- Add new background job types
- Configure job retry logic
- Analyze job processing failures
- Optimize worker performance
- Handle job timeouts
What bg-worker Knows
The bg-worker subagent understands:
- BullMQ queue patterns
- PM2 worker configuration
- Job retry and backoff strategies
- Queue monitoring and debugging
- Redis connection for queues
- Worker health checks (ADR-053)
Queue Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ API Server │───►│ Redis (BullMQ) │◄───│ Worker │
│ │ │ │ │ │
│ queue.add() │ │ flyerQueue │ │ process jobs │
│ │ │ cleanupQueue │ │ │
└─────────────────┘ │ analyticsQueue │ └─────────────────┘
└─────────────────┘
Example Requests
Debugging stuck jobs:
"Use bg-worker to debug why jobs are stuck in the flyer processing
queue. Check for failed jobs, worker status, and Redis connectivity."
Adding retry logic:
"Use bg-worker to add exponential backoff retry logic to the
AI extraction job. It should retry up to 3 times with increasing
delays for rate limit errors."
Queue monitoring:
"Use bg-worker to add health check endpoints for monitoring
queue depth and worker status."
Queue Configuration
// src/services/queues.server.ts
export const flyerQueue = new Queue('flyer-processing', {
connection: redisConnection,
defaultJobOptions: {
attempts: 3,
backoff: {
type: 'exponential',
delay: 1000,
},
removeOnComplete: { count: 100 },
removeOnFail: { count: 1000 },
},
});
Worker Configuration
// src/services/workers.server.ts
export const flyerWorker = new Worker(
'flyer-processing',
async (job) => {
// Process job
},
{
connection: redisConnection,
concurrency: 5,
limiter: {
max: 10,
duration: 1000,
},
},
);
Monitoring Queues
# Check queue status via Redis
redis-cli -a $REDIS_PASSWORD
> KEYS bull:*
> LLEN bull:flyer-processing:wait
> ZRANGE bull:flyer-processing:failed 0 -1
Service Management Commands
PM2 Commands
# Start/reload
pm2 startOrReload ecosystem.config.cjs --env production --update-env && pm2 save
# View status
pm2 list
pm2 status
# View logs
pm2 logs
pm2 logs flyer-crawler-api --lines 100
# Restart specific process
pm2 restart flyer-crawler-api
pm2 restart flyer-crawler-worker
# Stop all
pm2 stop all
# Delete all
pm2 delete all
Systemd Services (Production)
| Service | Command | ||
|---|---|---|---|
| PostgreSQL | `sudo systemctl {start | stop | status} postgresql` |
| Redis | `sudo systemctl {start | stop | status} redis-server` |
| NGINX | `sudo systemctl {start | stop | status} nginx` |
| Bugsink | `sudo systemctl {start | stop | status} gunicorn-bugsink` |
| Logstash | `sudo systemctl {start | stop | status} logstash` |
Health Checks
# API health check
curl http://localhost:3001/api/health
# PM2 health
pm2 list
# PostgreSQL health
pg_isready -h localhost -p 5432
# Redis health
redis-cli -a $REDIS_PASSWORD ping
Troubleshooting Guide
Container Won't Start
- Check container logs:
podman-compose logs app - Verify services are healthy:
podman-compose ps - Check environment variables in
compose.dev.yml - Try rebuilding:
podman-compose build --no-cache app
Tests Fail in Container but Pass Locally
Tests must run in the Linux container environment:
# Wrong (Windows)
npm test
# Correct (in container)
podman exec -it flyer-crawler-dev npm test
PM2 Process Keeps Restarting
- Check logs:
pm2 logs <process-name> - Check memory usage:
pm2 monit - Verify environment variables:
pm2 env <process-id> - Check for unhandled errors in application code
Database Connection Refused
- Verify PostgreSQL is running
- Check connection string in environment
- Verify database user has permissions
- Check
pg_hba.conffor allowed connections
Redis Connection Issues
- Verify Redis is running:
redis-cli ping - Check password in environment variables
- Verify Redis is listening on expected port
- Check
maxmemorysetting if queue operations fail
Related Documentation
- OVERVIEW.md - Subagent system overview
- ../BARE-METAL-SETUP.md - Production setup guide
- ../adr/0014-containerization-and-deployment-strategy.md - Containerization ADR
- ../adr/0006-background-job-processing-and-task-queues.md - Background jobs ADR
- ../adr/0017-ci-cd-and-branching-strategy.md - CI/CD strategy
- ../adr/0053-worker-health-checks.md - Worker health checks