doc updates and test fixin
This commit is contained in:
466
docs/subagents/DEVOPS-GUIDE.md
Normal file
466
docs/subagents/DEVOPS-GUIDE.md
Normal file
@@ -0,0 +1,466 @@
|
||||
# DevOps Subagent Guide
|
||||
|
||||
This guide covers DevOps-related subagents for deployment, infrastructure, and operations:
|
||||
|
||||
- **devops**: Containers, services, CI/CD pipelines, deployments
|
||||
- **infra-architect**: Resource optimization, capacity planning
|
||||
- **bg-worker**: Background jobs, PM2 workers, BullMQ queues
|
||||
|
||||
## The devops Subagent
|
||||
|
||||
### When to Use
|
||||
|
||||
Use the **devops** subagent when you need to:
|
||||
|
||||
- Debug container issues in development
|
||||
- Modify CI/CD pipelines
|
||||
- Configure PM2 for production
|
||||
- Update deployment workflows
|
||||
- Troubleshoot service startup issues
|
||||
- Configure NGINX or reverse proxy
|
||||
- Set up SSL/TLS certificates
|
||||
|
||||
### What devops Knows
|
||||
|
||||
The devops subagent understands:
|
||||
|
||||
- Podman/Docker container management
|
||||
- Dev container configuration (`.devcontainer/`)
|
||||
- Compose files (`compose.dev.yml`)
|
||||
- PM2 ecosystem configuration
|
||||
- Gitea Actions CI/CD workflows
|
||||
- NGINX configuration
|
||||
- Systemd service management
|
||||
|
||||
### Development Environment
|
||||
|
||||
**Container Architecture:**
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Development Environment │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ app │ │ postgres │ │ redis │ │
|
||||
│ │ (Node.js) │───►│ (PostGIS) │ │ (Cache) │ │
|
||||
│ │ │───►│ │ │ │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
│ :3000/:3001 :5432 :6379 │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Container Services:**
|
||||
|
||||
| Service | Image | Purpose | Port |
|
||||
|---------|-------|---------|------|
|
||||
| `app` | Custom (Dockerfile.dev) | Node.js application | 3000, 3001 |
|
||||
| `postgres` | postgis/postgis:15-3.4 | Database with PostGIS | 5432 |
|
||||
| `redis` | redis:alpine | Caching and job queues | 6379 |
|
||||
|
||||
### Example Requests
|
||||
|
||||
**Container debugging:**
|
||||
```
|
||||
"Use devops to debug why the dev container fails to start.
|
||||
The postgres service shows as unhealthy and the app can't connect."
|
||||
```
|
||||
|
||||
**CI/CD pipeline update:**
|
||||
```
|
||||
"Use devops to add a step to the deploy-to-test.yml workflow
|
||||
that runs database migrations before restarting the app."
|
||||
```
|
||||
|
||||
**PM2 configuration:**
|
||||
```
|
||||
"Use devops to update the PM2 ecosystem config to use cluster
|
||||
mode with 4 instances instead of max for the API server."
|
||||
```
|
||||
|
||||
### Container Commands Reference
|
||||
|
||||
```bash
|
||||
# Start development environment
|
||||
podman-compose -f compose.dev.yml up -d
|
||||
|
||||
# View container logs
|
||||
podman-compose -f compose.dev.yml logs -f app
|
||||
|
||||
# Restart specific service
|
||||
podman-compose -f compose.dev.yml restart app
|
||||
|
||||
# Rebuild container (after Dockerfile changes)
|
||||
podman-compose -f compose.dev.yml build app
|
||||
|
||||
# Reset everything
|
||||
podman-compose -f compose.dev.yml down -v
|
||||
podman-compose -f compose.dev.yml up -d --build
|
||||
|
||||
# Enter container shell
|
||||
podman exec -it flyer-crawler-dev bash
|
||||
|
||||
# Run tests in container (from Windows)
|
||||
podman exec -it flyer-crawler-dev npm run test:unit
|
||||
```
|
||||
|
||||
### Git Bash Path Conversion (Windows)
|
||||
|
||||
When running commands from Git Bash on Windows, paths may be incorrectly converted:
|
||||
|
||||
| Solution | Example |
|
||||
|----------|---------|
|
||||
| `sh -c` with single quotes | `podman exec container sh -c '/usr/local/bin/script.sh'` |
|
||||
| Double slashes | `podman exec container //usr//local//bin//script.sh` |
|
||||
| MSYS_NO_PATHCONV=1 | `MSYS_NO_PATHCONV=1 podman exec ...` |
|
||||
|
||||
### PM2 Production Configuration
|
||||
|
||||
**ecosystem.config.cjs Structure:**
|
||||
|
||||
```javascript
|
||||
module.exports = {
|
||||
apps: [
|
||||
{
|
||||
name: 'flyer-crawler-api',
|
||||
script: './node_modules/.bin/tsx',
|
||||
args: 'server.ts',
|
||||
instances: 'max', // Use all CPU cores
|
||||
exec_mode: 'cluster', // Enable cluster mode
|
||||
max_memory_restart: '500M',
|
||||
kill_timeout: 5000, // Graceful shutdown
|
||||
|
||||
env_production: {
|
||||
NODE_ENV: 'production',
|
||||
cwd: '/var/www/flyer-crawler.projectium.com',
|
||||
},
|
||||
},
|
||||
{
|
||||
name: 'flyer-crawler-worker',
|
||||
script: './node_modules/.bin/tsx',
|
||||
args: 'src/services/worker.ts',
|
||||
instances: 1, // Single instance for workers
|
||||
max_memory_restart: '1G',
|
||||
kill_timeout: 10000, // Workers need more time
|
||||
},
|
||||
],
|
||||
};
|
||||
```
|
||||
|
||||
**PM2 Commands:**
|
||||
|
||||
```bash
|
||||
# Start/reload with environment
|
||||
pm2 startOrReload ecosystem.config.cjs --env production --update-env
|
||||
|
||||
# Save process list
|
||||
pm2 save
|
||||
|
||||
# View logs
|
||||
pm2 logs flyer-crawler-api --lines 50
|
||||
|
||||
# Monitor processes
|
||||
pm2 monit
|
||||
|
||||
# Describe process
|
||||
pm2 describe flyer-crawler-api
|
||||
```
|
||||
|
||||
### CI/CD Workflow Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `.gitea/workflows/deploy-to-prod.yml` | Production deployment |
|
||||
| `.gitea/workflows/deploy-to-test.yml` | Test environment deployment |
|
||||
|
||||
**Deployment Flow:**
|
||||
|
||||
1. Push to `main` branch
|
||||
2. Gitea Actions triggered
|
||||
3. SSH to production server
|
||||
4. Pull latest code
|
||||
5. Install dependencies
|
||||
6. Run build
|
||||
7. Run migrations
|
||||
8. Restart PM2 processes
|
||||
|
||||
### Directory Structure (Production)
|
||||
|
||||
```
|
||||
/var/www/
|
||||
├── flyer-crawler.projectium.com/ # Production
|
||||
│ ├── server.ts
|
||||
│ ├── ecosystem.config.cjs
|
||||
│ ├── package.json
|
||||
│ ├── flyer-images/
|
||||
│ │ ├── icons/
|
||||
│ │ └── archive/
|
||||
│ └── logs/
|
||||
│ └── app.log
|
||||
└── flyer-crawler-test.projectium.com/ # Test environment
|
||||
└── ... (same structure)
|
||||
```
|
||||
|
||||
## The infra-architect Subagent
|
||||
|
||||
### When to Use
|
||||
|
||||
Use the **infra-architect** subagent when you need to:
|
||||
|
||||
- Analyze resource usage and optimize
|
||||
- Plan for scaling
|
||||
- Reduce infrastructure costs
|
||||
- Configure memory limits
|
||||
- Analyze disk usage
|
||||
- Plan capacity for growth
|
||||
|
||||
### What infra-architect Knows
|
||||
|
||||
The infra-architect subagent understands:
|
||||
|
||||
- Node.js memory management
|
||||
- PostgreSQL resource tuning
|
||||
- Redis memory configuration
|
||||
- Container resource limits
|
||||
- PM2 process monitoring
|
||||
- Disk and storage management
|
||||
|
||||
### Example Requests
|
||||
|
||||
**Memory optimization:**
|
||||
```
|
||||
"Use infra-architect to analyze memory usage of the worker
|
||||
processes. They're frequently hitting the 1GB limit and restarting."
|
||||
```
|
||||
|
||||
**Capacity planning:**
|
||||
```
|
||||
"Use infra-architect to estimate resource requirements for
|
||||
handling 10x current traffic. Include database, Redis, and
|
||||
application server recommendations."
|
||||
```
|
||||
|
||||
**Cost optimization:**
|
||||
```
|
||||
"Use infra-architect to identify opportunities to reduce
|
||||
infrastructure costs without impacting performance."
|
||||
```
|
||||
|
||||
### Resource Limits Reference
|
||||
|
||||
| Process | Memory Limit | Notes |
|
||||
|---------|--------------|-------|
|
||||
| API Server | 500MB | Per cluster instance |
|
||||
| Worker | 1GB | Single instance |
|
||||
| Analytics Worker | 1GB | Single instance |
|
||||
| PostgreSQL | System RAM | Tune `shared_buffers` |
|
||||
| Redis | 256MB | `maxmemory` setting |
|
||||
|
||||
## The bg-worker Subagent
|
||||
|
||||
### When to Use
|
||||
|
||||
Use the **bg-worker** subagent when you need to:
|
||||
|
||||
- Debug BullMQ queue issues
|
||||
- Add new background job types
|
||||
- Configure job retry logic
|
||||
- Analyze job processing failures
|
||||
- Optimize worker performance
|
||||
- Handle job timeouts
|
||||
|
||||
### What bg-worker Knows
|
||||
|
||||
The bg-worker subagent understands:
|
||||
|
||||
- BullMQ queue patterns
|
||||
- PM2 worker configuration
|
||||
- Job retry and backoff strategies
|
||||
- Queue monitoring and debugging
|
||||
- Redis connection for queues
|
||||
- Worker health checks (ADR-053)
|
||||
|
||||
### Queue Architecture
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ API Server │───►│ Redis (BullMQ) │◄───│ Worker │
|
||||
│ │ │ │ │ │
|
||||
│ queue.add() │ │ flyerQueue │ │ process jobs │
|
||||
│ │ │ cleanupQueue │ │ │
|
||||
└─────────────────┘ │ analyticsQueue │ └─────────────────┘
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
### Example Requests
|
||||
|
||||
**Debugging stuck jobs:**
|
||||
```
|
||||
"Use bg-worker to debug why jobs are stuck in the flyer processing
|
||||
queue. Check for failed jobs, worker status, and Redis connectivity."
|
||||
```
|
||||
|
||||
**Adding retry logic:**
|
||||
```
|
||||
"Use bg-worker to add exponential backoff retry logic to the
|
||||
AI extraction job. It should retry up to 3 times with increasing
|
||||
delays for rate limit errors."
|
||||
```
|
||||
|
||||
**Queue monitoring:**
|
||||
```
|
||||
"Use bg-worker to add health check endpoints for monitoring
|
||||
queue depth and worker status."
|
||||
```
|
||||
|
||||
### Queue Configuration
|
||||
|
||||
```typescript
|
||||
// src/services/queues.server.ts
|
||||
export const flyerQueue = new Queue('flyer-processing', {
|
||||
connection: redisConnection,
|
||||
defaultJobOptions: {
|
||||
attempts: 3,
|
||||
backoff: {
|
||||
type: 'exponential',
|
||||
delay: 1000,
|
||||
},
|
||||
removeOnComplete: { count: 100 },
|
||||
removeOnFail: { count: 1000 },
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
### Worker Configuration
|
||||
|
||||
```typescript
|
||||
// src/services/workers.server.ts
|
||||
export const flyerWorker = new Worker(
|
||||
'flyer-processing',
|
||||
async (job) => {
|
||||
// Process job
|
||||
},
|
||||
{
|
||||
connection: redisConnection,
|
||||
concurrency: 5,
|
||||
limiter: {
|
||||
max: 10,
|
||||
duration: 1000,
|
||||
},
|
||||
}
|
||||
);
|
||||
```
|
||||
|
||||
### Monitoring Queues
|
||||
|
||||
```bash
|
||||
# Check queue status via Redis
|
||||
redis-cli -a $REDIS_PASSWORD
|
||||
|
||||
> KEYS bull:*
|
||||
> LLEN bull:flyer-processing:wait
|
||||
> ZRANGE bull:flyer-processing:failed 0 -1
|
||||
```
|
||||
|
||||
## Service Management Commands
|
||||
|
||||
### PM2 Commands
|
||||
|
||||
```bash
|
||||
# Start/reload
|
||||
pm2 startOrReload ecosystem.config.cjs --env production --update-env && pm2 save
|
||||
|
||||
# View status
|
||||
pm2 list
|
||||
pm2 status
|
||||
|
||||
# View logs
|
||||
pm2 logs
|
||||
pm2 logs flyer-crawler-api --lines 100
|
||||
|
||||
# Restart specific process
|
||||
pm2 restart flyer-crawler-api
|
||||
pm2 restart flyer-crawler-worker
|
||||
|
||||
# Stop all
|
||||
pm2 stop all
|
||||
|
||||
# Delete all
|
||||
pm2 delete all
|
||||
```
|
||||
|
||||
### Systemd Services (Production)
|
||||
|
||||
| Service | Command |
|
||||
|---------|---------|
|
||||
| PostgreSQL | `sudo systemctl {start|stop|status} postgresql` |
|
||||
| Redis | `sudo systemctl {start|stop|status} redis-server` |
|
||||
| NGINX | `sudo systemctl {start|stop|status} nginx` |
|
||||
| Bugsink | `sudo systemctl {start|stop|status} gunicorn-bugsink` |
|
||||
| Logstash | `sudo systemctl {start|stop|status} logstash` |
|
||||
|
||||
### Health Checks
|
||||
|
||||
```bash
|
||||
# API health check
|
||||
curl http://localhost:3001/api/health
|
||||
|
||||
# PM2 health
|
||||
pm2 list
|
||||
|
||||
# PostgreSQL health
|
||||
pg_isready -h localhost -p 5432
|
||||
|
||||
# Redis health
|
||||
redis-cli -a $REDIS_PASSWORD ping
|
||||
```
|
||||
|
||||
## Troubleshooting Guide
|
||||
|
||||
### Container Won't Start
|
||||
|
||||
1. Check container logs: `podman-compose logs app`
|
||||
2. Verify services are healthy: `podman-compose ps`
|
||||
3. Check environment variables in `compose.dev.yml`
|
||||
4. Try rebuilding: `podman-compose build --no-cache app`
|
||||
|
||||
### Tests Fail in Container but Pass Locally
|
||||
|
||||
Tests must run in the Linux container environment:
|
||||
|
||||
```bash
|
||||
# Wrong (Windows)
|
||||
npm test
|
||||
|
||||
# Correct (in container)
|
||||
podman exec -it flyer-crawler-dev npm test
|
||||
```
|
||||
|
||||
### PM2 Process Keeps Restarting
|
||||
|
||||
1. Check logs: `pm2 logs <process-name>`
|
||||
2. Check memory usage: `pm2 monit`
|
||||
3. Verify environment variables: `pm2 env <process-id>`
|
||||
4. Check for unhandled errors in application code
|
||||
|
||||
### Database Connection Refused
|
||||
|
||||
1. Verify PostgreSQL is running
|
||||
2. Check connection string in environment
|
||||
3. Verify database user has permissions
|
||||
4. Check `pg_hba.conf` for allowed connections
|
||||
|
||||
### Redis Connection Issues
|
||||
|
||||
1. Verify Redis is running: `redis-cli ping`
|
||||
2. Check password in environment variables
|
||||
3. Verify Redis is listening on expected port
|
||||
4. Check `maxmemory` setting if queue operations fail
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [OVERVIEW.md](./OVERVIEW.md) - Subagent system overview
|
||||
- [../BARE-METAL-SETUP.md](../BARE-METAL-SETUP.md) - Production setup guide
|
||||
- [../adr/0014-containerization-and-deployment-strategy.md](../adr/0014-containerization-and-deployment-strategy.md) - Containerization ADR
|
||||
- [../adr/0006-background-job-processing-and-task-queues.md](../adr/0006-background-job-processing-and-task-queues.md) - Background jobs ADR
|
||||
- [../adr/0017-ci-cd-and-branching-strategy.md](../adr/0017-ci-cd-and-branching-strategy.md) - CI/CD strategy
|
||||
- [../adr/0053-worker-health-checks.md](../adr/0053-worker-health-checks.md) - Worker health checks
|
||||
Reference in New Issue
Block a user