Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 2m50s
551 lines
16 KiB
Markdown
551 lines
16 KiB
Markdown
# DevOps Subagent Guide
|
|
|
|
This guide covers DevOps-related subagents for deployment, infrastructure, and operations:
|
|
|
|
- **devops**: Containers, services, CI/CD pipelines, deployments
|
|
- **infra-architect**: Resource optimization, capacity planning
|
|
- **bg-worker**: Background jobs, PM2 workers, BullMQ queues
|
|
|
|
---
|
|
|
|
## CRITICAL: Server Access Model
|
|
|
|
**Claude Code has READ-ONLY access to production/test servers.**
|
|
|
|
The `claude-win10` user cannot execute write operations (PM2 restart, systemctl, file modifications) directly on servers. The devops subagent must **provide commands for the user to execute**, not attempt to run them via SSH.
|
|
|
|
### Command Delegation Workflow
|
|
|
|
When troubleshooting or making changes to production/test servers:
|
|
|
|
| Phase | Actor | Action |
|
|
| -------- | ------ | ----------------------------------------------------------- |
|
|
| Diagnose | Claude | Provide read-only diagnostic commands |
|
|
| Report | User | Execute commands, share output with Claude |
|
|
| Analyze | Claude | Interpret results, identify root cause |
|
|
| Fix | Claude | Provide 1-3 fix commands (never more, errors may cascade) |
|
|
| Execute | User | Run fix commands, report results |
|
|
| Verify | Claude | Provide verification commands to confirm success |
|
|
| Document | Claude | Update relevant documentation with findings and resolutions |
|
|
|
|
### Example: PM2 Process Issue
|
|
|
|
Step 1 - Diagnostic Commands (Claude provides, user runs):
|
|
|
|
```bash
|
|
# Check PM2 process status
|
|
pm2 list
|
|
|
|
# View recent error logs
|
|
pm2 logs flyer-crawler-api --err --lines 50
|
|
|
|
# Check system resources
|
|
free -h
|
|
df -h /var/www
|
|
```
|
|
|
|
Step 2 - User reports output to Claude
|
|
|
|
Step 3 - Fix Commands (Claude provides 1-3 at a time):
|
|
|
|
```bash
|
|
# Restart the failing process
|
|
pm2 restart flyer-crawler-api
|
|
```
|
|
|
|
Step 4 - User executes and reports result
|
|
|
|
Step 5 - Verification Commands:
|
|
|
|
```bash
|
|
# Confirm process is running
|
|
pm2 list
|
|
|
|
# Test API health
|
|
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq .
|
|
```
|
|
|
|
### What NOT to Do
|
|
|
|
```bash
|
|
# WRONG - Claude cannot execute this directly
|
|
ssh root@projectium.com "pm2 restart all"
|
|
|
|
# WRONG - Providing too many commands at once
|
|
pm2 stop all && rm -rf node_modules && npm install && pm2 start all
|
|
|
|
# WRONG - Assuming commands succeeded without user confirmation
|
|
```
|
|
|
|
---
|
|
|
|
## The devops Subagent
|
|
|
|
### When to Use
|
|
|
|
Use the **devops** subagent when you need to:
|
|
|
|
- Debug container issues in development
|
|
- Modify CI/CD pipelines
|
|
- Configure PM2 for production
|
|
- Update deployment workflows
|
|
- Troubleshoot service startup issues
|
|
- Configure NGINX or reverse proxy
|
|
- Set up SSL/TLS certificates
|
|
|
|
### What devops Knows
|
|
|
|
The devops subagent understands:
|
|
|
|
- Podman/Docker container management
|
|
- Dev container configuration (`.devcontainer/`)
|
|
- Compose files (`compose.dev.yml`)
|
|
- PM2 ecosystem configuration
|
|
- Gitea Actions CI/CD workflows
|
|
- NGINX configuration
|
|
- Systemd service management
|
|
|
|
### Development Environment
|
|
|
|
**Container Architecture:**
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Development Environment │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ app │ │ postgres │ │ redis │ │
|
|
│ │ (Node.js) │───►│ (PostGIS) │ │ (Cache) │ │
|
|
│ │ │───►│ │ │ │ │
|
|
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
|
│ :3000/:3001 :5432 :6379 │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Container Services:**
|
|
|
|
| Service | Image | Purpose | Port |
|
|
| ---------- | ----------------------- | ---------------------- | ---------- |
|
|
| `app` | Custom (Dockerfile.dev) | Node.js application | 3000, 3001 |
|
|
| `postgres` | postgis/postgis:15-3.4 | Database with PostGIS | 5432 |
|
|
| `redis` | redis:alpine | Caching and job queues | 6379 |
|
|
|
|
### Example Requests
|
|
|
|
**Container debugging:**
|
|
|
|
```
|
|
"Use devops to debug why the dev container fails to start.
|
|
The postgres service shows as unhealthy and the app can't connect."
|
|
```
|
|
|
|
**CI/CD pipeline update:**
|
|
|
|
```
|
|
"Use devops to add a step to the deploy-to-test.yml workflow
|
|
that runs database migrations before restarting the app."
|
|
```
|
|
|
|
**PM2 configuration:**
|
|
|
|
```
|
|
"Use devops to update the PM2 ecosystem config to use cluster
|
|
mode with 4 instances instead of max for the API server."
|
|
```
|
|
|
|
### Container Commands Reference
|
|
|
|
```bash
|
|
# Start development environment
|
|
podman-compose -f compose.dev.yml up -d
|
|
|
|
# View container logs
|
|
podman-compose -f compose.dev.yml logs -f app
|
|
|
|
# Restart specific service
|
|
podman-compose -f compose.dev.yml restart app
|
|
|
|
# Rebuild container (after Dockerfile changes)
|
|
podman-compose -f compose.dev.yml build app
|
|
|
|
# Reset everything
|
|
podman-compose -f compose.dev.yml down -v
|
|
podman-compose -f compose.dev.yml up -d --build
|
|
|
|
# Enter container shell
|
|
podman exec -it flyer-crawler-dev bash
|
|
|
|
# Run tests in container (from Windows)
|
|
podman exec -it flyer-crawler-dev npm run test:unit
|
|
```
|
|
|
|
### Git Bash Path Conversion (Windows)
|
|
|
|
When running commands from Git Bash on Windows, paths may be incorrectly converted:
|
|
|
|
| Solution | Example |
|
|
| -------------------------- | -------------------------------------------------------- |
|
|
| `sh -c` with single quotes | `podman exec container sh -c '/usr/local/bin/script.sh'` |
|
|
| Double slashes | `podman exec container //usr//local//bin//script.sh` |
|
|
| MSYS_NO_PATHCONV=1 | `MSYS_NO_PATHCONV=1 podman exec ...` |
|
|
|
|
### PM2 Production Configuration
|
|
|
|
**ecosystem.config.cjs Structure:**
|
|
|
|
```javascript
|
|
module.exports = {
|
|
apps: [
|
|
{
|
|
name: 'flyer-crawler-api',
|
|
script: './node_modules/.bin/tsx',
|
|
args: 'server.ts',
|
|
instances: 'max', // Use all CPU cores
|
|
exec_mode: 'cluster', // Enable cluster mode
|
|
max_memory_restart: '500M',
|
|
kill_timeout: 5000, // Graceful shutdown
|
|
|
|
env_production: {
|
|
NODE_ENV: 'production',
|
|
cwd: '/var/www/flyer-crawler.projectium.com',
|
|
},
|
|
},
|
|
{
|
|
name: 'flyer-crawler-worker',
|
|
script: './node_modules/.bin/tsx',
|
|
args: 'src/services/worker.ts',
|
|
instances: 1, // Single instance for workers
|
|
max_memory_restart: '1G',
|
|
kill_timeout: 10000, // Workers need more time
|
|
},
|
|
],
|
|
};
|
|
```
|
|
|
|
**PM2 Commands:**
|
|
|
|
```bash
|
|
# Start/reload with environment
|
|
pm2 startOrReload ecosystem.config.cjs --env production --update-env
|
|
|
|
# Save process list
|
|
pm2 save
|
|
|
|
# View logs
|
|
pm2 logs flyer-crawler-api --lines 50
|
|
|
|
# Monitor processes
|
|
pm2 monit
|
|
|
|
# Describe process
|
|
pm2 describe flyer-crawler-api
|
|
```
|
|
|
|
### CI/CD Workflow Files
|
|
|
|
| File | Purpose |
|
|
| ------------------------------------- | --------------------------- |
|
|
| `.gitea/workflows/deploy-to-prod.yml` | Production deployment |
|
|
| `.gitea/workflows/deploy-to-test.yml` | Test environment deployment |
|
|
|
|
**Deployment Flow:**
|
|
|
|
1. Push to `main` branch
|
|
2. Gitea Actions triggered
|
|
3. SSH to production server
|
|
4. Pull latest code
|
|
5. Install dependencies
|
|
6. Run build
|
|
7. Run migrations
|
|
8. Restart PM2 processes
|
|
|
|
### Directory Structure (Production)
|
|
|
|
```
|
|
/var/www/
|
|
├── flyer-crawler.projectium.com/ # Production
|
|
│ ├── server.ts
|
|
│ ├── ecosystem.config.cjs
|
|
│ ├── package.json
|
|
│ ├── flyer-images/
|
|
│ │ ├── icons/
|
|
│ │ └── archive/
|
|
│ └── logs/
|
|
│ └── app.log
|
|
└── flyer-crawler-test.projectium.com/ # Test environment
|
|
└── ... (same structure)
|
|
```
|
|
|
|
## The infra-architect Subagent
|
|
|
|
### When to Use
|
|
|
|
Use the **infra-architect** subagent when you need to:
|
|
|
|
- Analyze resource usage and optimize
|
|
- Plan for scaling
|
|
- Reduce infrastructure costs
|
|
- Configure memory limits
|
|
- Analyze disk usage
|
|
- Plan capacity for growth
|
|
|
|
### What infra-architect Knows
|
|
|
|
The infra-architect subagent understands:
|
|
|
|
- Node.js memory management
|
|
- PostgreSQL resource tuning
|
|
- Redis memory configuration
|
|
- Container resource limits
|
|
- PM2 process monitoring
|
|
- Disk and storage management
|
|
|
|
### Example Requests
|
|
|
|
**Memory optimization:**
|
|
|
|
```
|
|
"Use infra-architect to analyze memory usage of the worker
|
|
processes. They're frequently hitting the 1GB limit and restarting."
|
|
```
|
|
|
|
**Capacity planning:**
|
|
|
|
```
|
|
"Use infra-architect to estimate resource requirements for
|
|
handling 10x current traffic. Include database, Redis, and
|
|
application server recommendations."
|
|
```
|
|
|
|
**Cost optimization:**
|
|
|
|
```
|
|
"Use infra-architect to identify opportunities to reduce
|
|
infrastructure costs without impacting performance."
|
|
```
|
|
|
|
### Resource Limits Reference
|
|
|
|
| Process | Memory Limit | Notes |
|
|
| ---------------- | ------------ | --------------------- |
|
|
| API Server | 500MB | Per cluster instance |
|
|
| Worker | 1GB | Single instance |
|
|
| Analytics Worker | 1GB | Single instance |
|
|
| PostgreSQL | System RAM | Tune `shared_buffers` |
|
|
| Redis | 256MB | `maxmemory` setting |
|
|
|
|
## The bg-worker Subagent
|
|
|
|
### When to Use
|
|
|
|
Use the **bg-worker** subagent when you need to:
|
|
|
|
- Debug BullMQ queue issues
|
|
- Add new background job types
|
|
- Configure job retry logic
|
|
- Analyze job processing failures
|
|
- Optimize worker performance
|
|
- Handle job timeouts
|
|
|
|
### What bg-worker Knows
|
|
|
|
The bg-worker subagent understands:
|
|
|
|
- BullMQ queue patterns
|
|
- PM2 worker configuration
|
|
- Job retry and backoff strategies
|
|
- Queue monitoring and debugging
|
|
- Redis connection for queues
|
|
- Worker health checks (ADR-053)
|
|
|
|
### Queue Architecture
|
|
|
|
```
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ API Server │───►│ Redis (BullMQ) │◄───│ Worker │
|
|
│ │ │ │ │ │
|
|
│ queue.add() │ │ flyerQueue │ │ process jobs │
|
|
│ │ │ cleanupQueue │ │ │
|
|
└─────────────────┘ │ analyticsQueue │ └─────────────────┘
|
|
└─────────────────┘
|
|
```
|
|
|
|
### Example Requests
|
|
|
|
**Debugging stuck jobs:**
|
|
|
|
```
|
|
"Use bg-worker to debug why jobs are stuck in the flyer processing
|
|
queue. Check for failed jobs, worker status, and Redis connectivity."
|
|
```
|
|
|
|
**Adding retry logic:**
|
|
|
|
```
|
|
"Use bg-worker to add exponential backoff retry logic to the
|
|
AI extraction job. It should retry up to 3 times with increasing
|
|
delays for rate limit errors."
|
|
```
|
|
|
|
**Queue monitoring:**
|
|
|
|
```
|
|
"Use bg-worker to add health check endpoints for monitoring
|
|
queue depth and worker status."
|
|
```
|
|
|
|
### Queue Configuration
|
|
|
|
```typescript
|
|
// src/services/queues.server.ts
|
|
export const flyerQueue = new Queue('flyer-processing', {
|
|
connection: redisConnection,
|
|
defaultJobOptions: {
|
|
attempts: 3,
|
|
backoff: {
|
|
type: 'exponential',
|
|
delay: 1000,
|
|
},
|
|
removeOnComplete: { count: 100 },
|
|
removeOnFail: { count: 1000 },
|
|
},
|
|
});
|
|
```
|
|
|
|
### Worker Configuration
|
|
|
|
```typescript
|
|
// src/services/workers.server.ts
|
|
export const flyerWorker = new Worker(
|
|
'flyer-processing',
|
|
async (job) => {
|
|
// Process job
|
|
},
|
|
{
|
|
connection: redisConnection,
|
|
concurrency: 5,
|
|
limiter: {
|
|
max: 10,
|
|
duration: 1000,
|
|
},
|
|
},
|
|
);
|
|
```
|
|
|
|
### Monitoring Queues
|
|
|
|
```bash
|
|
# Check queue status via Redis
|
|
redis-cli -a $REDIS_PASSWORD
|
|
|
|
> KEYS bull:*
|
|
> LLEN bull:flyer-processing:wait
|
|
> ZRANGE bull:flyer-processing:failed 0 -1
|
|
```
|
|
|
|
## Service Management Commands
|
|
|
|
> **Note**: These commands are for the **user to execute on the server**. Claude Code provides these commands but cannot run them directly due to read-only server access. See [Server Access Model](#critical-server-access-model) above.
|
|
|
|
### PM2 Commands
|
|
|
|
```bash
|
|
# Start/reload
|
|
pm2 startOrReload ecosystem.config.cjs --env production --update-env && pm2 save
|
|
|
|
# View status
|
|
pm2 list
|
|
pm2 status
|
|
|
|
# View logs
|
|
pm2 logs
|
|
pm2 logs flyer-crawler-api --lines 100
|
|
|
|
# Restart specific process
|
|
pm2 restart flyer-crawler-api
|
|
pm2 restart flyer-crawler-worker
|
|
|
|
# Stop all
|
|
pm2 stop all
|
|
|
|
# Delete all
|
|
pm2 delete all
|
|
```
|
|
|
|
### Systemd Services (Production)
|
|
|
|
| Service | Command |
|
|
| ---------- | ---------------------- | ---- | ------------------------- |
|
|
| PostgreSQL | `sudo systemctl {start | stop | status} postgresql` |
|
|
| Redis | `sudo systemctl {start | stop | status} redis-server` |
|
|
| NGINX | `sudo systemctl {start | stop | status} nginx` |
|
|
| Bugsink | `sudo systemctl {start | stop | status} gunicorn-bugsink` |
|
|
| Logstash | `sudo systemctl {start | stop | status} logstash` |
|
|
|
|
### Health Checks
|
|
|
|
```bash
|
|
# API health check
|
|
curl http://localhost:3001/api/health
|
|
|
|
# PM2 health
|
|
pm2 list
|
|
|
|
# PostgreSQL health
|
|
pg_isready -h localhost -p 5432
|
|
|
|
# Redis health
|
|
redis-cli -a $REDIS_PASSWORD ping
|
|
```
|
|
|
|
## Troubleshooting Guide
|
|
|
|
### Container Won't Start
|
|
|
|
1. Check container logs: `podman-compose logs app`
|
|
2. Verify services are healthy: `podman-compose ps`
|
|
3. Check environment variables in `compose.dev.yml`
|
|
4. Try rebuilding: `podman-compose build --no-cache app`
|
|
|
|
### Tests Fail in Container but Pass Locally
|
|
|
|
Tests must run in the Linux container environment:
|
|
|
|
```bash
|
|
# Wrong (Windows)
|
|
npm test
|
|
|
|
# Correct (in container)
|
|
podman exec -it flyer-crawler-dev npm test
|
|
```
|
|
|
|
### PM2 Process Keeps Restarting
|
|
|
|
1. Check logs: `pm2 logs <process-name>`
|
|
2. Check memory usage: `pm2 monit`
|
|
3. Verify environment variables: `pm2 env <process-id>`
|
|
4. Check for unhandled errors in application code
|
|
|
|
### Database Connection Refused
|
|
|
|
1. Verify PostgreSQL is running
|
|
2. Check connection string in environment
|
|
3. Verify database user has permissions
|
|
4. Check `pg_hba.conf` for allowed connections
|
|
|
|
### Redis Connection Issues
|
|
|
|
1. Verify Redis is running: `redis-cli ping`
|
|
2. Check password in environment variables
|
|
3. Verify Redis is listening on expected port
|
|
4. Check `maxmemory` setting if queue operations fail
|
|
|
|
## Related Documentation
|
|
|
|
- [OVERVIEW.md](./OVERVIEW.md) - Subagent system overview
|
|
- [../BARE-METAL-SETUP.md](../BARE-METAL-SETUP.md) - Production setup guide
|
|
- [../adr/0014-containerization-and-deployment-strategy.md](../adr/0014-containerization-and-deployment-strategy.md) - Containerization ADR
|
|
- [../adr/0006-background-job-processing-and-task-queues.md](../adr/0006-background-job-processing-and-task-queues.md) - Background jobs ADR
|
|
- [../adr/0017-ci-cd-and-branching-strategy.md](../adr/0017-ci-cd-and-branching-strategy.md) - CI/CD strategy
|
|
- [../adr/0053-worker-health-checks.md](../adr/0053-worker-health-checks.md) - Worker health checks
|