Files
flyer-crawler.projectium.com/docs/subagents/DEVOPS-GUIDE.md
Torben Sorensen 4f06698dfd
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 2m50s
test fixes and doc work
2026-01-28 15:33:48 -08:00

551 lines
16 KiB
Markdown

# DevOps Subagent Guide
This guide covers DevOps-related subagents for deployment, infrastructure, and operations:
- **devops**: Containers, services, CI/CD pipelines, deployments
- **infra-architect**: Resource optimization, capacity planning
- **bg-worker**: Background jobs, PM2 workers, BullMQ queues
---
## CRITICAL: Server Access Model
**Claude Code has READ-ONLY access to production/test servers.**
The `claude-win10` user cannot execute write operations (PM2 restart, systemctl, file modifications) directly on servers. The devops subagent must **provide commands for the user to execute**, not attempt to run them via SSH.
### Command Delegation Workflow
When troubleshooting or making changes to production/test servers:
| Phase | Actor | Action |
| -------- | ------ | ----------------------------------------------------------- |
| Diagnose | Claude | Provide read-only diagnostic commands |
| Report | User | Execute commands, share output with Claude |
| Analyze | Claude | Interpret results, identify root cause |
| Fix | Claude | Provide 1-3 fix commands (never more, errors may cascade) |
| Execute | User | Run fix commands, report results |
| Verify | Claude | Provide verification commands to confirm success |
| Document | Claude | Update relevant documentation with findings and resolutions |
### Example: PM2 Process Issue
Step 1 - Diagnostic Commands (Claude provides, user runs):
```bash
# Check PM2 process status
pm2 list
# View recent error logs
pm2 logs flyer-crawler-api --err --lines 50
# Check system resources
free -h
df -h /var/www
```
Step 2 - User reports output to Claude
Step 3 - Fix Commands (Claude provides 1-3 at a time):
```bash
# Restart the failing process
pm2 restart flyer-crawler-api
```
Step 4 - User executes and reports result
Step 5 - Verification Commands:
```bash
# Confirm process is running
pm2 list
# Test API health
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq .
```
### What NOT to Do
```bash
# WRONG - Claude cannot execute this directly
ssh root@projectium.com "pm2 restart all"
# WRONG - Providing too many commands at once
pm2 stop all && rm -rf node_modules && npm install && pm2 start all
# WRONG - Assuming commands succeeded without user confirmation
```
---
## The devops Subagent
### When to Use
Use the **devops** subagent when you need to:
- Debug container issues in development
- Modify CI/CD pipelines
- Configure PM2 for production
- Update deployment workflows
- Troubleshoot service startup issues
- Configure NGINX or reverse proxy
- Set up SSL/TLS certificates
### What devops Knows
The devops subagent understands:
- Podman/Docker container management
- Dev container configuration (`.devcontainer/`)
- Compose files (`compose.dev.yml`)
- PM2 ecosystem configuration
- Gitea Actions CI/CD workflows
- NGINX configuration
- Systemd service management
### Development Environment
**Container Architecture:**
```
┌─────────────────────────────────────────────────────────────┐
│ Development Environment │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ app │ │ postgres │ │ redis │ │
│ │ (Node.js) │───►│ (PostGIS) │ │ (Cache) │ │
│ │ │───►│ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ :3000/:3001 :5432 :6379 │
└─────────────────────────────────────────────────────────────┘
```
**Container Services:**
| Service | Image | Purpose | Port |
| ---------- | ----------------------- | ---------------------- | ---------- |
| `app` | Custom (Dockerfile.dev) | Node.js application | 3000, 3001 |
| `postgres` | postgis/postgis:15-3.4 | Database with PostGIS | 5432 |
| `redis` | redis:alpine | Caching and job queues | 6379 |
### Example Requests
**Container debugging:**
```
"Use devops to debug why the dev container fails to start.
The postgres service shows as unhealthy and the app can't connect."
```
**CI/CD pipeline update:**
```
"Use devops to add a step to the deploy-to-test.yml workflow
that runs database migrations before restarting the app."
```
**PM2 configuration:**
```
"Use devops to update the PM2 ecosystem config to use cluster
mode with 4 instances instead of max for the API server."
```
### Container Commands Reference
```bash
# Start development environment
podman-compose -f compose.dev.yml up -d
# View container logs
podman-compose -f compose.dev.yml logs -f app
# Restart specific service
podman-compose -f compose.dev.yml restart app
# Rebuild container (after Dockerfile changes)
podman-compose -f compose.dev.yml build app
# Reset everything
podman-compose -f compose.dev.yml down -v
podman-compose -f compose.dev.yml up -d --build
# Enter container shell
podman exec -it flyer-crawler-dev bash
# Run tests in container (from Windows)
podman exec -it flyer-crawler-dev npm run test:unit
```
### Git Bash Path Conversion (Windows)
When running commands from Git Bash on Windows, paths may be incorrectly converted:
| Solution | Example |
| -------------------------- | -------------------------------------------------------- |
| `sh -c` with single quotes | `podman exec container sh -c '/usr/local/bin/script.sh'` |
| Double slashes | `podman exec container //usr//local//bin//script.sh` |
| MSYS_NO_PATHCONV=1 | `MSYS_NO_PATHCONV=1 podman exec ...` |
### PM2 Production Configuration
**ecosystem.config.cjs Structure:**
```javascript
module.exports = {
apps: [
{
name: 'flyer-crawler-api',
script: './node_modules/.bin/tsx',
args: 'server.ts',
instances: 'max', // Use all CPU cores
exec_mode: 'cluster', // Enable cluster mode
max_memory_restart: '500M',
kill_timeout: 5000, // Graceful shutdown
env_production: {
NODE_ENV: 'production',
cwd: '/var/www/flyer-crawler.projectium.com',
},
},
{
name: 'flyer-crawler-worker',
script: './node_modules/.bin/tsx',
args: 'src/services/worker.ts',
instances: 1, // Single instance for workers
max_memory_restart: '1G',
kill_timeout: 10000, // Workers need more time
},
],
};
```
**PM2 Commands:**
```bash
# Start/reload with environment
pm2 startOrReload ecosystem.config.cjs --env production --update-env
# Save process list
pm2 save
# View logs
pm2 logs flyer-crawler-api --lines 50
# Monitor processes
pm2 monit
# Describe process
pm2 describe flyer-crawler-api
```
### CI/CD Workflow Files
| File | Purpose |
| ------------------------------------- | --------------------------- |
| `.gitea/workflows/deploy-to-prod.yml` | Production deployment |
| `.gitea/workflows/deploy-to-test.yml` | Test environment deployment |
**Deployment Flow:**
1. Push to `main` branch
2. Gitea Actions triggered
3. SSH to production server
4. Pull latest code
5. Install dependencies
6. Run build
7. Run migrations
8. Restart PM2 processes
### Directory Structure (Production)
```
/var/www/
├── flyer-crawler.projectium.com/ # Production
│ ├── server.ts
│ ├── ecosystem.config.cjs
│ ├── package.json
│ ├── flyer-images/
│ │ ├── icons/
│ │ └── archive/
│ └── logs/
│ └── app.log
└── flyer-crawler-test.projectium.com/ # Test environment
└── ... (same structure)
```
## The infra-architect Subagent
### When to Use
Use the **infra-architect** subagent when you need to:
- Analyze resource usage and optimize
- Plan for scaling
- Reduce infrastructure costs
- Configure memory limits
- Analyze disk usage
- Plan capacity for growth
### What infra-architect Knows
The infra-architect subagent understands:
- Node.js memory management
- PostgreSQL resource tuning
- Redis memory configuration
- Container resource limits
- PM2 process monitoring
- Disk and storage management
### Example Requests
**Memory optimization:**
```
"Use infra-architect to analyze memory usage of the worker
processes. They're frequently hitting the 1GB limit and restarting."
```
**Capacity planning:**
```
"Use infra-architect to estimate resource requirements for
handling 10x current traffic. Include database, Redis, and
application server recommendations."
```
**Cost optimization:**
```
"Use infra-architect to identify opportunities to reduce
infrastructure costs without impacting performance."
```
### Resource Limits Reference
| Process | Memory Limit | Notes |
| ---------------- | ------------ | --------------------- |
| API Server | 500MB | Per cluster instance |
| Worker | 1GB | Single instance |
| Analytics Worker | 1GB | Single instance |
| PostgreSQL | System RAM | Tune `shared_buffers` |
| Redis | 256MB | `maxmemory` setting |
## The bg-worker Subagent
### When to Use
Use the **bg-worker** subagent when you need to:
- Debug BullMQ queue issues
- Add new background job types
- Configure job retry logic
- Analyze job processing failures
- Optimize worker performance
- Handle job timeouts
### What bg-worker Knows
The bg-worker subagent understands:
- BullMQ queue patterns
- PM2 worker configuration
- Job retry and backoff strategies
- Queue monitoring and debugging
- Redis connection for queues
- Worker health checks (ADR-053)
### Queue Architecture
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ API Server │───►│ Redis (BullMQ) │◄───│ Worker │
│ │ │ │ │ │
│ queue.add() │ │ flyerQueue │ │ process jobs │
│ │ │ cleanupQueue │ │ │
└─────────────────┘ │ analyticsQueue │ └─────────────────┘
└─────────────────┘
```
### Example Requests
**Debugging stuck jobs:**
```
"Use bg-worker to debug why jobs are stuck in the flyer processing
queue. Check for failed jobs, worker status, and Redis connectivity."
```
**Adding retry logic:**
```
"Use bg-worker to add exponential backoff retry logic to the
AI extraction job. It should retry up to 3 times with increasing
delays for rate limit errors."
```
**Queue monitoring:**
```
"Use bg-worker to add health check endpoints for monitoring
queue depth and worker status."
```
### Queue Configuration
```typescript
// src/services/queues.server.ts
export const flyerQueue = new Queue('flyer-processing', {
connection: redisConnection,
defaultJobOptions: {
attempts: 3,
backoff: {
type: 'exponential',
delay: 1000,
},
removeOnComplete: { count: 100 },
removeOnFail: { count: 1000 },
},
});
```
### Worker Configuration
```typescript
// src/services/workers.server.ts
export const flyerWorker = new Worker(
'flyer-processing',
async (job) => {
// Process job
},
{
connection: redisConnection,
concurrency: 5,
limiter: {
max: 10,
duration: 1000,
},
},
);
```
### Monitoring Queues
```bash
# Check queue status via Redis
redis-cli -a $REDIS_PASSWORD
> KEYS bull:*
> LLEN bull:flyer-processing:wait
> ZRANGE bull:flyer-processing:failed 0 -1
```
## Service Management Commands
> **Note**: These commands are for the **user to execute on the server**. Claude Code provides these commands but cannot run them directly due to read-only server access. See [Server Access Model](#critical-server-access-model) above.
### PM2 Commands
```bash
# Start/reload
pm2 startOrReload ecosystem.config.cjs --env production --update-env && pm2 save
# View status
pm2 list
pm2 status
# View logs
pm2 logs
pm2 logs flyer-crawler-api --lines 100
# Restart specific process
pm2 restart flyer-crawler-api
pm2 restart flyer-crawler-worker
# Stop all
pm2 stop all
# Delete all
pm2 delete all
```
### Systemd Services (Production)
| Service | Command |
| ---------- | ---------------------- | ---- | ------------------------- |
| PostgreSQL | `sudo systemctl {start | stop | status} postgresql` |
| Redis | `sudo systemctl {start | stop | status} redis-server` |
| NGINX | `sudo systemctl {start | stop | status} nginx` |
| Bugsink | `sudo systemctl {start | stop | status} gunicorn-bugsink` |
| Logstash | `sudo systemctl {start | stop | status} logstash` |
### Health Checks
```bash
# API health check
curl http://localhost:3001/api/health
# PM2 health
pm2 list
# PostgreSQL health
pg_isready -h localhost -p 5432
# Redis health
redis-cli -a $REDIS_PASSWORD ping
```
## Troubleshooting Guide
### Container Won't Start
1. Check container logs: `podman-compose logs app`
2. Verify services are healthy: `podman-compose ps`
3. Check environment variables in `compose.dev.yml`
4. Try rebuilding: `podman-compose build --no-cache app`
### Tests Fail in Container but Pass Locally
Tests must run in the Linux container environment:
```bash
# Wrong (Windows)
npm test
# Correct (in container)
podman exec -it flyer-crawler-dev npm test
```
### PM2 Process Keeps Restarting
1. Check logs: `pm2 logs <process-name>`
2. Check memory usage: `pm2 monit`
3. Verify environment variables: `pm2 env <process-id>`
4. Check for unhandled errors in application code
### Database Connection Refused
1. Verify PostgreSQL is running
2. Check connection string in environment
3. Verify database user has permissions
4. Check `pg_hba.conf` for allowed connections
### Redis Connection Issues
1. Verify Redis is running: `redis-cli ping`
2. Check password in environment variables
3. Verify Redis is listening on expected port
4. Check `maxmemory` setting if queue operations fail
## Related Documentation
- [OVERVIEW.md](./OVERVIEW.md) - Subagent system overview
- [../BARE-METAL-SETUP.md](../BARE-METAL-SETUP.md) - Production setup guide
- [../adr/0014-containerization-and-deployment-strategy.md](../adr/0014-containerization-and-deployment-strategy.md) - Containerization ADR
- [../adr/0006-background-job-processing-and-task-queues.md](../adr/0006-background-job-processing-and-task-queues.md) - Background jobs ADR
- [../adr/0017-ci-cd-and-branching-strategy.md](../adr/0017-ci-cd-and-branching-strategy.md) - CI/CD strategy
- [../adr/0053-worker-health-checks.md](../adr/0053-worker-health-checks.md) - Worker health checks