# Monitoring Guide

This guide covers all aspects of monitoring the Flyer Crawler application across development, test, and production environments.

## Table of Contents

1. [Health Checks](#health-checks)
2. [Bugsink Error Tracking](#bugsink-error-tracking)
3. [Logstash Log Aggregation](#logstash-log-aggregation)
4. [PM2 Process Monitoring](#pm2-process-monitoring)
5. [Database Monitoring](#database-monitoring)
6. [Redis Monitoring](#redis-monitoring)
7. [Production Alerts and On-Call](#production-alerts-and-on-call)

---

## Health Checks

The application exposes health check endpoints at `/api/health/*` implementing ADR-020.

### Endpoint Reference

| Endpoint                | Purpose                | Use Case                                |
| ----------------------- | ---------------------- | --------------------------------------- |
| `/api/health/ping`      | Simple connectivity    | Quick "is it running?" check            |
| `/api/health/live`      | Liveness probe         | Container orchestration restart trigger |
| `/api/health/ready`     | Readiness probe        | Load balancer traffic routing           |
| `/api/health/startup`   | Startup probe          | Initial container readiness             |
| `/api/health/db-schema` | Schema verification    | Deployment validation                   |
| `/api/health/db-pool`   | Connection pool status | Performance diagnostics                 |
| `/api/health/redis`     | Redis connectivity     | Cache/queue health                      |
| `/api/health/storage`   | File storage access    | Upload capability                       |
| `/api/health/time`      | Server time sync       | Time-sensitive operations               |

### Liveness Probe (`/api/health/live`)

Returns 200 OK if the Node.js process is running. No external dependencies.

```bash
# Check liveness
curl -s https://flyer-crawler.projectium.com/api/health/live | jq .

# Expected response
{
  "success": true,
  "data": {
    "status": "ok",
    "timestamp": "2026-01-22T10:00:00.000Z"
  }
}
```

**Usage**: If this endpoint fails, restart the application immediately.

### Readiness Probe (`/api/health/ready`)

Comprehensive check of all critical dependencies: database, Redis, and storage.

```bash
# Check readiness
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq .

# Expected healthy response (200)
{
  "success": true,
  "data": {
    "status": "healthy",
    "timestamp": "2026-01-22T10:00:00.000Z",
    "uptime": 3600.5,
    "services": {
      "database": {
        "status": "healthy",
        "latency": 5,
        "details": {
          "totalConnections": 10,
          "idleConnections": 8,
          "waitingConnections": 0
        }
      },
      "redis": {
        "status": "healthy",
        "latency": 2
      },
      "storage": {
        "status": "healthy",
        "latency": 1,
        "details": {
          "path": "/var/www/flyer-crawler.projectium.com/flyer-images"
        }
      }
    }
  }
}
```

**Status Values**:

| Status      | Meaning                                          | Action                    |
| ----------- | ------------------------------------------------ | ------------------------- |
| `healthy`   | All critical services operational                | None required             |
| `degraded`  | Non-critical issues (e.g., high connection wait) | Monitor closely           |
| `unhealthy` | Critical service unavailable (returns 503)       | Remove from load balancer |

### Database Health Thresholds

| Metric              | Healthy             | Degraded | Unhealthy        |
| ------------------- | ------------------- | -------- | ---------------- |
| Query response      | `SELECT 1` succeeds | N/A      | Connection fails |
| Waiting connections | 0-3                 | 4+       | N/A              |

### Verifying Services from CLI

**Production**:

```bash
# Quick health check
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'

# Database pool status
curl -s https://flyer-crawler.projectium.com/api/health/db-pool | jq .

# Redis health
curl -s https://flyer-crawler.projectium.com/api/health/redis | jq .
```

**Test Environment**:

```bash
# Test environment runs on port 3002
curl -s https://flyer-crawler-test.projectium.com/api/health/ready | jq .
```

**Dev Container**:

```bash
# From inside the container
curl -s http://localhost:3001/api/health/ready | jq .

# From Windows host (via port mapping)
curl -s http://localhost:3001/api/health/ready | jq .
```

### Admin System Check UI

The admin dashboard at `/admin` includes a **System Check** component that runs all health checks with a visual interface:

1. Navigate to `https://flyer-crawler.projectium.com/admin`
2. Login with admin credentials
3. View the "System Check" section
4. Click "Re-run Checks" to verify all services

Checks include:

- Backend Server Connection
- PM2 Process Status
- Database Connection Pool
- Redis Connection
- Database Schema
- Default Admin User
- Assets Storage Directory
- Gemini API Key

---

## Bugsink Error Tracking

Bugsink is our self-hosted, Sentry-compatible error tracking system (ADR-015).

### Access Points

| Environment       | URL                              | Purpose                    |
| ----------------- | -------------------------------- | -------------------------- |
| **Production**    | `https://bugsink.projectium.com` | Production and test errors |
| **Dev Container** | `https://localhost:8443`         | Local development errors   |

### Credentials

**Production Bugsink**:

- Credentials stored in password manager
- Admin account created during initial deployment

**Dev Container Bugsink**:

- Email: `admin@localhost`
- Password: `admin`

### Projects

| Project ID | Name                              | Environment | Error Source                    |
| ---------- | --------------------------------- | ----------- | ------------------------------- |
| 1          | flyer-crawler-backend             | Production  | Backend Node.js errors          |
| 2          | flyer-crawler-frontend            | Production  | Frontend JavaScript errors      |
| 3          | flyer-crawler-backend-test        | Test        | Test environment backend        |
| 4          | flyer-crawler-frontend-test       | Test        | Test environment frontend       |
| 5          | flyer-crawler-infrastructure      | Production  | PostgreSQL, Redis, NGINX errors |
| 6          | flyer-crawler-test-infrastructure | Test        | Test infra errors               |

**Dev Container Projects** (localhost:8000):

- Project 1: Backend (Dev)
- Project 2: Frontend (Dev)

### Accessing Errors via Web UI

1. Navigate to the Bugsink URL
2. Login with credentials
3. Select project from the sidebar
4. Click on an issue to view details

**Issue Details Include**:

- Exception type and message
- Full stack trace
- Request context (URL, method, headers)
- User context (if authenticated)
- Occurrence statistics (first seen, last seen, count)
- Release/version information

### Accessing Errors via MCP

Claude Code and other AI tools can access Bugsink via MCP servers.

**Available MCP Tools**:

```bash
# List all projects
mcp__bugsink__list_projects

# List unresolved issues for a project
mcp__bugsink__list_issues --project_id 1 --status unresolved

# Get issue details
mcp__bugsink__get_issue --issue_id <uuid>

# Get stacktrace (pre-rendered Markdown)
mcp__bugsink__get_stacktrace --event_id <uuid>

# List events for an issue
mcp__bugsink__list_events --issue_id <uuid>
```

**MCP Server Configuration**:

Production (in `~/.claude/settings.json`):

```json
{
  "bugsink": {
    "command": "node",
    "args": ["d:\\gitea\\bugsink-mcp\\dist\\index.js"],
    "env": {
      "BUGSINK_URL": "https://bugsink.projectium.com",
      "BUGSINK_TOKEN": "<token>"
    }
  }
}
```

Dev Container (in `.mcp.json`):

```json
{
  "localerrors": {
    "command": "node",
    "args": ["d:\\gitea\\bugsink-mcp\\dist\\index.js"],
    "env": {
      "BUGSINK_URL": "http://127.0.0.1:8000",
      "BUGSINK_TOKEN": "<token>"
    }
  }
}
```

### Creating API Tokens

Bugsink 2.0.11 does not have a UI for API tokens. Create via Django management command.

**Production**:

```bash
ssh root@projectium.com "cd /opt/bugsink && bugsink-manage create_auth_token"
```

**Dev Container**:

```bash
MSYS_NO_PATHCONV=1 podman exec -e DATABASE_URL=postgresql://bugsink:bugsink_dev_password@postgres:5432/bugsink -e SECRET_KEY=dev-bugsink-secret-key-minimum-50-characters-for-security flyer-crawler-dev sh -c 'cd /opt/bugsink/conf && DJANGO_SETTINGS_MODULE=bugsink_conf PYTHONPATH=/opt/bugsink/conf:/opt/bugsink/lib/python3.10/site-packages /opt/bugsink/bin/python -m django create_auth_token'
```

The command outputs a 40-character hex token.

### Interpreting Errors

**Error Anatomy**:

```
TypeError: Cannot read properties of undefined (reading 'map')
├── Exception Type: TypeError
├── Message: Cannot read properties of undefined (reading 'map')
├── Where: FlyerItemsList.tsx:45:23
├── When: 2026-01-22T10:30:00.000Z
├── Count: 12 occurrences
└── Context:
    ├── URL: GET /api/flyers/123/items
    ├── User: user@example.com
    └── Release: v0.12.5
```

**Common Error Patterns**:

| Pattern                             | Likely Cause                                      | Investigation                                      |
| ----------------------------------- | ------------------------------------------------- | -------------------------------------------------- |
| `TypeError: ... undefined`          | Missing null check, API returned unexpected shape | Check API response, add defensive coding           |
| `DatabaseError: Connection timeout` | Pool exhaustion, slow queries                     | Check `/api/health/db-pool`, review slow query log |
| `RedisConnectionError`              | Redis unavailable                                 | Check Redis service, network connectivity          |
| `ValidationError: ...`              | Invalid input, schema mismatch                    | Review request payload, update validation          |
| `NotFoundError: ...`                | Missing resource                                  | Verify resource exists, check ID format            |

### Error Triage Workflow

1. **Review new issues daily** in Bugsink
2. **Categorize by severity**:
   - **Critical**: Data corruption, security, payment failures
   - **High**: Core feature broken for many users
   - **Medium**: Feature degraded, workaround available
   - **Low**: Minor UX issues, cosmetic bugs
3. **Check occurrence count** - frequent errors need urgent attention
4. **Review stack trace** - identify root cause
5. **Check recent deployments** - did a release introduce this?
6. **Create Gitea issue** if not auto-synced

### Bugsink-to-Gitea Sync

The test environment automatically syncs Bugsink issues to Gitea (see `docs/BUGSINK-SYNC.md`).

**Sync Workflow**:

1. Runs every 15 minutes on test server
2. Fetches unresolved issues from all Bugsink projects
3. Creates Gitea issues with appropriate labels
4. Marks synced issues as resolved in Bugsink

**Manual Sync**:

```bash
# Trigger sync via API (test environment only)
curl -X POST https://flyer-crawler-test.projectium.com/api/admin/bugsink/sync \
  -H "Authorization: Bearer <admin_jwt>"
```

---

## Logstash Log Aggregation

Logstash aggregates logs from multiple sources and forwards errors to Bugsink (ADR-050).

### Architecture

```
Log Sources                    Logstash                  Outputs
┌──────────────┐              ┌─────────────┐           ┌─────────────┐
│ PostgreSQL   │──────────────│             │───────────│ Bugsink     │
│ PM2 Workers  │──────────────│   Filter    │───────────│ (errors)    │
│ Redis        │──────────────│   & Route   │───────────│             │
│ NGINX        │──────────────│             │───────────│ File Logs   │
└──────────────┘              └─────────────┘           │ (all logs)  │
                                                        └─────────────┘
```

### Configuration Files

| Path                                                | Purpose                     |
| --------------------------------------------------- | --------------------------- |
| `/etc/logstash/conf.d/bugsink.conf`                 | Main pipeline configuration |
| `/etc/postgresql/14/main/conf.d/observability.conf` | PostgreSQL logging settings |
| `/var/log/logstash/`                                | Logstash file outputs       |
| `/var/lib/logstash/sincedb_*`                       | File position tracking      |

### Log Sources

| Source      | Path                                               | Contents                            |
| ----------- | -------------------------------------------------- | ----------------------------------- |
| PostgreSQL  | `/var/log/postgresql/*.log`                        | Function logs, slow queries, errors |
| PM2 Workers | `/home/gitea-runner/.pm2/logs/flyer-crawler-*.log` | Worker stdout/stderr                |
| Redis       | `/var/log/redis/redis-server.log`                  | Connection errors, memory warnings  |
| NGINX       | `/var/log/nginx/access.log`, `error.log`           | HTTP requests, upstream errors      |

### Pipeline Status

**Check Logstash Service**:

```bash
ssh root@projectium.com

# Service status
systemctl status logstash

# Recent logs
journalctl -u logstash -n 50 --no-pager

# Pipeline statistics
curl -s http://localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.events'

# Events processed today
curl -s http://localhost:9600/_node/stats/pipelines?pretty | jq '{
  in: .pipelines.main.events.in,
  out: .pipelines.main.events.out,
  filtered: .pipelines.main.events.filtered
}'
```

**Check Filter Performance**:

```bash
# Grok pattern success/failure rates
curl -s http://localhost:9600/_node/stats/pipelines?pretty | \
  jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | {name, events_in: .events.in, events_out: .events.out, failures}'
```

### Viewing Aggregated Logs

```bash
# PM2 worker logs (all workers combined)
tail -f /var/log/logstash/pm2-workers-$(date +%Y-%m-%d).log

# Redis operational logs
tail -f /var/log/logstash/redis-operational-$(date +%Y-%m-%d).log

# NGINX access logs (parsed)
tail -f /var/log/logstash/nginx-access-$(date +%Y-%m-%d).log

# PostgreSQL function logs
tail -f /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log
```

### Troubleshooting Logstash

| Issue                 | Diagnostic                  | Solution                        |
| --------------------- | --------------------------- | ------------------------------- |
| No events processed   | `systemctl status logstash` | Start/restart service           |
| Config syntax error   | Test config command         | Fix config file                 |
| Grok failures         | Check stats endpoint        | Update grok patterns            |
| Wrong Bugsink project | Check environment tags      | Verify tag routing              |
| Permission denied     | `groups logstash`           | Add to `postgres`, `adm` groups |
| PM2 logs not captured | Check file paths            | Verify log file existence       |
| High disk usage       | Check log rotation          | Configure logrotate             |

**Test Configuration**:

```bash
/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf
```

**Restart After Config Change**:

```bash
systemctl restart logstash
journalctl -u logstash -f  # Watch for startup errors
```

---

## PM2 Process Monitoring

PM2 manages the Node.js application processes in production.

### Process Overview

**Production Processes** (`ecosystem.config.cjs`):

| Process Name                     | Script      | Purpose              | Instances          |
| -------------------------------- | ----------- | -------------------- | ------------------ |
| `flyer-crawler-api`              | `server.ts` | Express API server   | Cluster (max CPUs) |
| `flyer-crawler-worker`           | `worker.ts` | BullMQ job processor | 1                  |
| `flyer-crawler-analytics-worker` | `worker.ts` | Analytics jobs       | 1                  |

**Test Processes** (`ecosystem-test.config.cjs`):

| Process Name                          | Script      | Port | Instances     |
| ------------------------------------- | ----------- | ---- | ------------- |
| `flyer-crawler-api-test`              | `server.ts` | 3002 | 1 (fork mode) |
| `flyer-crawler-worker-test`           | `worker.ts` | N/A  | 1             |
| `flyer-crawler-analytics-worker-test` | `worker.ts` | N/A  | 1             |

### Basic Commands

```bash
ssh root@projectium.com
su - gitea-runner  # PM2 runs under this user

# List all processes
pm2 list

# Process details
pm2 show flyer-crawler-api

# Monitor in real-time
pm2 monit

# View logs
pm2 logs flyer-crawler-api
pm2 logs flyer-crawler-worker --lines 100

# View all logs
pm2 logs

# Restart processes
pm2 restart flyer-crawler-api
pm2 restart all

# Reload without downtime (cluster mode only)
pm2 reload flyer-crawler-api

# Stop processes
pm2 stop flyer-crawler-api
```

### Health Indicators

**Healthy Process**:

```
┌─────────────────────┬────┬─────────┬─────────┬───────┬────────┬─────────┬──────────┐
│ Name                │ id │ mode    │ status  │ cpu   │ mem    │ uptime  │ restarts │
├─────────────────────┼────┼─────────┼─────────┼───────┼────────┼─────────┼──────────┤
│ flyer-crawler-api   │ 0  │ cluster │ online  │ 0.5%  │ 150MB  │ 5d      │ 0        │
│ flyer-crawler-api   │ 1  │ cluster │ online  │ 0.3%  │ 145MB  │ 5d      │ 0        │
│ flyer-crawler-worker│ 2  │ fork    │ online  │ 0.1%  │ 200MB  │ 5d      │ 0        │
└─────────────────────┴────┴─────────┴─────────┴───────┴────────┴─────────┴──────────┘
```

**Warning Signs**:

- `status: errored` - Process crashed
- High `restarts` count - Instability
- High `mem` (>500MB for API, >1GB for workers) - Memory leak
- Low `uptime` with high restarts - Repeated crashes

### Log File Locations

| Process                | stdout                                                      | stderr          |
| ---------------------- | ----------------------------------------------------------- | --------------- |
| `flyer-crawler-api`    | `/home/gitea-runner/.pm2/logs/flyer-crawler-api-out.log`    | `...-error.log` |
| `flyer-crawler-worker` | `/home/gitea-runner/.pm2/logs/flyer-crawler-worker-out.log` | `...-error.log` |

### Memory Management

PM2 is configured to restart processes when they exceed memory limits:

| Process          | Memory Limit | Action       |
| ---------------- | ------------ | ------------ |
| API              | 500MB        | Auto-restart |
| Worker           | 1GB          | Auto-restart |
| Analytics Worker | 1GB          | Auto-restart |

**Check Memory Usage**:

```bash
pm2 show flyer-crawler-api | grep memory
pm2 show flyer-crawler-worker | grep memory
```

### Restart Strategies

PM2 uses exponential backoff for restarts:

```javascript
{
  max_restarts: 40,
  exp_backoff_restart_delay: 100,  // Start at 100ms, exponentially increase
  min_uptime: '10s',  // Must run 10s to be considered "started"
}
```

**Force Restart After Repeated Failures**:

```bash
pm2 delete flyer-crawler-api
pm2 start ecosystem.config.cjs --only flyer-crawler-api
```

---

## Database Monitoring

### Connection Pool Status

The application uses a PostgreSQL connection pool with these defaults:

| Setting                   | Value | Purpose                          |
| ------------------------- | ----- | -------------------------------- |
| `max`                     | 20    | Maximum concurrent connections   |
| `idleTimeoutMillis`       | 30000 | Close idle connections after 30s |
| `connectionTimeoutMillis` | 2000  | Fail if connection takes >2s     |

**Check Pool Status via API**:

```bash
curl -s https://flyer-crawler.projectium.com/api/health/db-pool | jq .

# Response
{
  "success": true,
  "data": {
    "message": "Pool Status: 10 total, 8 idle, 0 waiting.",
    "totalCount": 10,
    "idleCount": 8,
    "waitingCount": 0
  }
}
```

**Pool Health Thresholds**:

| Metric              | Healthy | Warning | Critical   |
| ------------------- | ------- | ------- | ---------- |
| Waiting Connections | 0-2     | 3-4     | 5+         |
| Total Connections   | 1-15    | 16-19   | 20 (maxed) |

### Slow Query Logging

PostgreSQL is configured to log slow queries:

```ini
# /etc/postgresql/14/main/conf.d/observability.conf
log_min_duration_statement = 1000  # Log queries over 1 second
```

**View Slow Queries**:

```bash
ssh root@projectium.com
grep "duration:" /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | tail -20
```

### Database Size Monitoring

```bash
# Connect to production database
psql -h localhost -U flyer_crawler_prod -d flyer-crawler-prod

# Database size
SELECT pg_size_pretty(pg_database_size('flyer-crawler-prod'));

# Table sizes
SELECT
  relname AS table,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
  pg_size_pretty(pg_relation_size(relid)) AS data_size,
  pg_size_pretty(pg_indexes_size(relid)) AS index_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;

# Check for bloat
SELECT schemaname, relname, n_dead_tup, n_live_tup,
       round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;
```

### Disk Space Monitoring

```bash
# Check PostgreSQL data directory
du -sh /var/lib/postgresql/14/main/

# Check available disk space
df -h /var/lib/postgresql/

# Estimate growth rate
psql -c "SELECT date_trunc('day', created_at) as day, count(*)
         FROM flyer_items
         WHERE created_at > now() - interval '7 days'
         GROUP BY 1 ORDER BY 1;"
```

### Database Health via MCP

```bash
# Query database directly
mcp__devdb__query --sql "SELECT count(*) FROM flyers WHERE created_at > now() - interval '1 day'"

# Check connection count
mcp__devdb__query --sql "SELECT count(*) FROM pg_stat_activity WHERE datname = 'flyer_crawler_dev'"
```

---

## Redis Monitoring

### Basic Health Check

```bash
# Via API endpoint
curl -s https://flyer-crawler.projectium.com/api/health/redis | jq .

# Direct Redis check (on server)
redis-cli ping  # Should return PONG
```

### Memory Usage

```bash
redis-cli info memory | grep -E "used_memory_human|maxmemory_human|mem_fragmentation_ratio"

# Expected output
used_memory_human:50.00M
maxmemory_human:256.00M
mem_fragmentation_ratio:1.05
```

**Memory Thresholds**:

| Metric              | Healthy     | Warning | Critical |
| ------------------- | ----------- | ------- | -------- |
| Used Memory         | <70% of max | 70-85%  | >85%     |
| Fragmentation Ratio | 1.0-1.5     | 1.5-2.0 | >2.0     |

### Cache Statistics

```bash
redis-cli info stats | grep -E "keyspace_hits|keyspace_misses|evicted_keys"

# Calculate hit rate
# Hit Rate = keyspace_hits / (keyspace_hits + keyspace_misses) * 100
```

**Cache Hit Rate Targets**:

- Excellent: >95%
- Good: 85-95%
- Needs attention: <85%

### Queue Monitoring

BullMQ queues are stored in Redis:

```bash
# List all queues
redis-cli keys "bull:*:id"

# Check queue depths
redis-cli llen "bull:flyer-processing:wait"
redis-cli llen "bull:email-sending:wait"
redis-cli llen "bull:analytics-reporting:wait"

# Check failed jobs
redis-cli llen "bull:flyer-processing:failed"
```

**Queue Depth Thresholds**:

| Queue               | Normal | Warning | Critical |
| ------------------- | ------ | ------- | -------- |
| flyer-processing    | 0-10   | 11-50   | >50      |
| email-sending       | 0-100  | 101-500 | >500     |
| analytics-reporting | 0-5    | 6-20    | >20      |

### Bull Board UI

Access the job queue dashboard:

- **Production**: `https://flyer-crawler.projectium.com/api/admin/jobs` (requires admin auth)
- **Test**: `https://flyer-crawler-test.projectium.com/api/admin/jobs`
- **Dev**: `http://localhost:3001/api/admin/jobs`

Features:

- View all queues and job counts
- Inspect job data and errors
- Retry failed jobs
- Clean completed jobs

### Redis Database Allocation

| Database | Purpose                  |
| -------- | ------------------------ |
| 0        | BullMQ production queues |
| 1        | BullMQ test queues       |
| 15       | Bugsink sync state       |

---

## Production Alerts and On-Call

### Critical Monitoring Targets

| Service    | Check               | Interval | Alert Threshold        |
| ---------- | ------------------- | -------- | ---------------------- |
| API Server | `/api/health/ready` | 1 min    | 2 consecutive failures |
| Database   | Pool waiting count  | 1 min    | >5 waiting             |
| Redis      | Memory usage        | 5 min    | >85% of maxmemory      |
| Disk Space | `/var/log`          | 15 min   | <10GB free             |
| Worker     | Queue depth         | 5 min    | >50 jobs waiting       |
| Error Rate | Bugsink issue count | 15 min   | >10 new issues/hour    |

### Alert Channels

Configure alerts in your monitoring tool (UptimeRobot, Datadog, etc.):

1. **Slack channel**: `#flyer-crawler-alerts`
2. **Email**: On-call rotation email
3. **PagerDuty**: Critical issues only

### On-Call Response Procedures

**P1 - Critical (Site Down)**:

1. Acknowledge alert within 5 minutes
2. Check `/api/health/ready` - identify failing service
3. Check PM2 status: `pm2 list`
4. Check recent deploys: `git log -5 --oneline`
5. If database: check pool, restart if needed
6. If Redis: check memory, flush if critical
7. If application: restart PM2 processes
8. Document in incident channel

**P2 - High (Degraded Service)**:

1. Acknowledge within 15 minutes
2. Review Bugsink for error patterns
3. Check system resources (CPU, memory, disk)
4. Identify root cause
5. Plan remediation
6. Create Gitea issue if not auto-created

**P3 - Medium (Non-Critical)**:

1. Acknowledge within 1 hour
2. Review during business hours
3. Create Gitea issue for tracking

### Quick Diagnostic Commands

```bash
# Full system health check
ssh root@projectium.com << 'EOF'
echo "=== Service Status ==="
systemctl status pm2-gitea-runner --no-pager
systemctl status logstash --no-pager
systemctl status redis --no-pager
systemctl status postgresql --no-pager

echo "=== PM2 Processes ==="
su - gitea-runner -c "pm2 list"

echo "=== Disk Space ==="
df -h / /var

echo "=== Memory ==="
free -h

echo "=== Recent Errors ==="
journalctl -p err -n 20 --no-pager
EOF
```

### Runbook Quick Reference

| Symptom         | First Action     | If That Fails         |
| --------------- | ---------------- | --------------------- |
| 503 errors      | Restart PM2      | Check database, Redis |
| Slow responses  | Check DB pool    | Review slow query log |
| High error rate | Check Bugsink    | Review recent deploys |
| Queue backlog   | Restart worker   | Scale workers         |
| Out of memory   | Restart process  | Increase PM2 limit    |
| Disk full       | Clean old logs   | Expand volume         |
| Redis OOM       | Flush cache keys | Increase maxmemory    |

### Post-Incident Review

After any P1/P2 incident:

1. Write incident report within 24 hours
2. Identify root cause
3. Document timeline of events
4. List action items to prevent recurrence
5. Schedule review meeting if needed
6. Update runbooks if new procedures discovered

---

## Related Documentation

- [ADR-015: Application Performance Monitoring](../adr/0015-application-performance-monitoring-and-error-tracking.md)
- [ADR-020: Health Checks](../adr/0020-health-checks-and-liveness-readiness-probes.md)
- [ADR-050: PostgreSQL Function Observability](../adr/0050-postgresql-function-observability.md)
- [ADR-053: Worker Health Checks](../adr/0053-worker-health-checks.md)
- [DEV-CONTAINER-BUGSINK.md](../DEV-CONTAINER-BUGSINK.md)
- [BUGSINK-SYNC.md](../BUGSINK-SYNC.md)
- [LOGSTASH-QUICK-REF.md](LOGSTASH-QUICK-REF.md)
- [LOGSTASH-TROUBLESHOOTING.md](LOGSTASH-TROUBLESHOOTING.md)
- [LOGSTASH_DEPLOYMENT_CHECKLIST.md](../LOGSTASH_DEPLOYMENT_CHECKLIST.md)