# Monitoring Guide This guide covers all aspects of monitoring the Flyer Crawler application across development, test, and production environments. ## Table of Contents 1. [Health Checks](#health-checks) 2. [Bugsink Error Tracking](#bugsink-error-tracking) 3. [Logstash Log Aggregation](#logstash-log-aggregation) 4. [PM2 Process Monitoring](#pm2-process-monitoring) 5. [Database Monitoring](#database-monitoring) 6. [Redis Monitoring](#redis-monitoring) 7. [Production Alerts and On-Call](#production-alerts-and-on-call) --- ## Health Checks The application exposes health check endpoints at `/api/health/*` implementing ADR-020. ### Endpoint Reference | Endpoint | Purpose | Use Case | | ----------------------- | ---------------------- | --------------------------------------- | | `/api/health/ping` | Simple connectivity | Quick "is it running?" check | | `/api/health/live` | Liveness probe | Container orchestration restart trigger | | `/api/health/ready` | Readiness probe | Load balancer traffic routing | | `/api/health/startup` | Startup probe | Initial container readiness | | `/api/health/db-schema` | Schema verification | Deployment validation | | `/api/health/db-pool` | Connection pool status | Performance diagnostics | | `/api/health/redis` | Redis connectivity | Cache/queue health | | `/api/health/storage` | File storage access | Upload capability | | `/api/health/time` | Server time sync | Time-sensitive operations | ### Liveness Probe (`/api/health/live`) Returns 200 OK if the Node.js process is running. No external dependencies. ```bash # Check liveness curl -s https://flyer-crawler.projectium.com/api/health/live | jq . # Expected response { "success": true, "data": { "status": "ok", "timestamp": "2026-01-22T10:00:00.000Z" } } ``` **Usage**: If this endpoint fails, restart the application immediately. ### Readiness Probe (`/api/health/ready`) Comprehensive check of all critical dependencies: database, Redis, and storage. ```bash # Check readiness curl -s https://flyer-crawler.projectium.com/api/health/ready | jq . # Expected healthy response (200) { "success": true, "data": { "status": "healthy", "timestamp": "2026-01-22T10:00:00.000Z", "uptime": 3600.5, "services": { "database": { "status": "healthy", "latency": 5, "details": { "totalConnections": 10, "idleConnections": 8, "waitingConnections": 0 } }, "redis": { "status": "healthy", "latency": 2 }, "storage": { "status": "healthy", "latency": 1, "details": { "path": "/var/www/flyer-crawler.projectium.com/flyer-images" } } } } } ``` **Status Values**: | Status | Meaning | Action | | ----------- | ------------------------------------------------ | ------------------------- | | `healthy` | All critical services operational | None required | | `degraded` | Non-critical issues (e.g., high connection wait) | Monitor closely | | `unhealthy` | Critical service unavailable (returns 503) | Remove from load balancer | ### Database Health Thresholds | Metric | Healthy | Degraded | Unhealthy | | ------------------- | ------------------- | -------- | ---------------- | | Query response | `SELECT 1` succeeds | N/A | Connection fails | | Waiting connections | 0-3 | 4+ | N/A | ### Verifying Services from CLI **Production**: ```bash # Quick health check curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status' # Database pool status curl -s https://flyer-crawler.projectium.com/api/health/db-pool | jq . # Redis health curl -s https://flyer-crawler.projectium.com/api/health/redis | jq . ``` **Test Environment**: ```bash # Test environment runs on port 3002 curl -s https://flyer-crawler-test.projectium.com/api/health/ready | jq . ``` **Dev Container**: ```bash # From inside the container curl -s http://localhost:3001/api/health/ready | jq . # From Windows host (via port mapping) curl -s http://localhost:3001/api/health/ready | jq . ``` ### Admin System Check UI The admin dashboard at `/admin` includes a **System Check** component that runs all health checks with a visual interface: 1. Navigate to `https://flyer-crawler.projectium.com/admin` 2. Login with admin credentials 3. View the "System Check" section 4. Click "Re-run Checks" to verify all services Checks include: - Backend Server Connection - PM2 Process Status - Database Connection Pool - Redis Connection - Database Schema - Default Admin User - Assets Storage Directory - Gemini API Key --- ## Bugsink Error Tracking Bugsink is our self-hosted, Sentry-compatible error tracking system (ADR-015). ### Access Points | Environment | URL | Purpose | | ----------------- | -------------------------------- | -------------------------- | | **Production** | `https://bugsink.projectium.com` | Production and test errors | | **Dev Container** | `https://localhost:8443` | Local development errors | ### Credentials **Production Bugsink**: - Credentials stored in password manager - Admin account created during initial deployment **Dev Container Bugsink**: - Email: `admin@localhost` - Password: `admin` ### Projects | Project ID | Name | Environment | Error Source | | ---------- | --------------------------------- | ----------- | ------------------------------- | | 1 | flyer-crawler-backend | Production | Backend Node.js errors | | 2 | flyer-crawler-frontend | Production | Frontend JavaScript errors | | 3 | flyer-crawler-backend-test | Test | Test environment backend | | 4 | flyer-crawler-frontend-test | Test | Test environment frontend | | 5 | flyer-crawler-infrastructure | Production | PostgreSQL, Redis, NGINX errors | | 6 | flyer-crawler-test-infrastructure | Test | Test infra errors | **Dev Container Projects** (localhost:8000): - Project 1: Backend (Dev) - Project 2: Frontend (Dev) ### Accessing Errors via Web UI 1. Navigate to the Bugsink URL 2. Login with credentials 3. Select project from the sidebar 4. Click on an issue to view details **Issue Details Include**: - Exception type and message - Full stack trace - Request context (URL, method, headers) - User context (if authenticated) - Occurrence statistics (first seen, last seen, count) - Release/version information ### Accessing Errors via MCP Claude Code and other AI tools can access Bugsink via MCP servers. **Available MCP Tools**: ```bash # List all projects mcp__bugsink__list_projects # List unresolved issues for a project mcp__bugsink__list_issues --project_id 1 --status unresolved # Get issue details mcp__bugsink__get_issue --issue_id # Get stacktrace (pre-rendered Markdown) mcp__bugsink__get_stacktrace --event_id # List events for an issue mcp__bugsink__list_events --issue_id ``` **MCP Server Configuration**: Production (in `~/.claude/settings.json`): ```json { "bugsink": { "command": "node", "args": ["d:\\gitea\\bugsink-mcp\\dist\\index.js"], "env": { "BUGSINK_URL": "https://bugsink.projectium.com", "BUGSINK_TOKEN": "" } } } ``` Dev Container (in `.mcp.json`): ```json { "localerrors": { "command": "node", "args": ["d:\\gitea\\bugsink-mcp\\dist\\index.js"], "env": { "BUGSINK_URL": "http://127.0.0.1:8000", "BUGSINK_TOKEN": "" } } } ``` ### Creating API Tokens Bugsink 2.0.11 does not have a UI for API tokens. Create via Django management command. **Production**: ```bash ssh root@projectium.com "cd /opt/bugsink && bugsink-manage create_auth_token" ``` **Dev Container**: ```bash MSYS_NO_PATHCONV=1 podman exec -e DATABASE_URL=postgresql://bugsink:bugsink_dev_password@postgres:5432/bugsink -e SECRET_KEY=dev-bugsink-secret-key-minimum-50-characters-for-security flyer-crawler-dev sh -c 'cd /opt/bugsink/conf && DJANGO_SETTINGS_MODULE=bugsink_conf PYTHONPATH=/opt/bugsink/conf:/opt/bugsink/lib/python3.10/site-packages /opt/bugsink/bin/python -m django create_auth_token' ``` The command outputs a 40-character hex token. ### Interpreting Errors **Error Anatomy**: ``` TypeError: Cannot read properties of undefined (reading 'map') ├── Exception Type: TypeError ├── Message: Cannot read properties of undefined (reading 'map') ├── Where: FlyerItemsList.tsx:45:23 ├── When: 2026-01-22T10:30:00.000Z ├── Count: 12 occurrences └── Context: ├── URL: GET /api/flyers/123/items ├── User: user@example.com └── Release: v0.12.5 ``` **Common Error Patterns**: | Pattern | Likely Cause | Investigation | | ----------------------------------- | ------------------------------------------------- | -------------------------------------------------- | | `TypeError: ... undefined` | Missing null check, API returned unexpected shape | Check API response, add defensive coding | | `DatabaseError: Connection timeout` | Pool exhaustion, slow queries | Check `/api/health/db-pool`, review slow query log | | `RedisConnectionError` | Redis unavailable | Check Redis service, network connectivity | | `ValidationError: ...` | Invalid input, schema mismatch | Review request payload, update validation | | `NotFoundError: ...` | Missing resource | Verify resource exists, check ID format | ### Error Triage Workflow 1. **Review new issues daily** in Bugsink 2. **Categorize by severity**: - **Critical**: Data corruption, security, payment failures - **High**: Core feature broken for many users - **Medium**: Feature degraded, workaround available - **Low**: Minor UX issues, cosmetic bugs 3. **Check occurrence count** - frequent errors need urgent attention 4. **Review stack trace** - identify root cause 5. **Check recent deployments** - did a release introduce this? 6. **Create Gitea issue** if not auto-synced ### Bugsink-to-Gitea Sync The test environment automatically syncs Bugsink issues to Gitea (see `docs/BUGSINK-SYNC.md`). **Sync Workflow**: 1. Runs every 15 minutes on test server 2. Fetches unresolved issues from all Bugsink projects 3. Creates Gitea issues with appropriate labels 4. Marks synced issues as resolved in Bugsink **Manual Sync**: ```bash # Trigger sync via API (test environment only) curl -X POST https://flyer-crawler-test.projectium.com/api/admin/bugsink/sync \ -H "Authorization: Bearer " ``` --- ## Logstash Log Aggregation Logstash aggregates logs from multiple sources and forwards errors to Bugsink (ADR-050). ### Architecture ``` Log Sources Logstash Outputs ┌──────────────┐ ┌─────────────┐ ┌─────────────┐ │ PostgreSQL │──────────────│ │───────────│ Bugsink │ │ PM2 Workers │──────────────│ Filter │───────────│ (errors) │ │ Redis │──────────────│ & Route │───────────│ │ │ NGINX │──────────────│ │───────────│ File Logs │ └──────────────┘ └─────────────┘ │ (all logs) │ └─────────────┘ ``` ### Configuration Files | Path | Purpose | | --------------------------------------------------- | --------------------------- | | `/etc/logstash/conf.d/bugsink.conf` | Main pipeline configuration | | `/etc/postgresql/14/main/conf.d/observability.conf` | PostgreSQL logging settings | | `/var/log/logstash/` | Logstash file outputs | | `/var/lib/logstash/sincedb_*` | File position tracking | ### Log Sources | Source | Path | Contents | | ----------- | -------------------------------------------------- | ----------------------------------- | | PostgreSQL | `/var/log/postgresql/*.log` | Function logs, slow queries, errors | | PM2 Workers | `/home/gitea-runner/.pm2/logs/flyer-crawler-*.log` | Worker stdout/stderr | | Redis | `/var/log/redis/redis-server.log` | Connection errors, memory warnings | | NGINX | `/var/log/nginx/access.log`, `error.log` | HTTP requests, upstream errors | ### Pipeline Status **Check Logstash Service**: ```bash ssh root@projectium.com # Service status systemctl status logstash # Recent logs journalctl -u logstash -n 50 --no-pager # Pipeline statistics curl -s http://localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.events' # Events processed today curl -s http://localhost:9600/_node/stats/pipelines?pretty | jq '{ in: .pipelines.main.events.in, out: .pipelines.main.events.out, filtered: .pipelines.main.events.filtered }' ``` **Check Filter Performance**: ```bash # Grok pattern success/failure rates curl -s http://localhost:9600/_node/stats/pipelines?pretty | \ jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | {name, events_in: .events.in, events_out: .events.out, failures}' ``` ### Viewing Aggregated Logs ```bash # PM2 worker logs (all workers combined) tail -f /var/log/logstash/pm2-workers-$(date +%Y-%m-%d).log # Redis operational logs tail -f /var/log/logstash/redis-operational-$(date +%Y-%m-%d).log # NGINX access logs (parsed) tail -f /var/log/logstash/nginx-access-$(date +%Y-%m-%d).log # PostgreSQL function logs tail -f /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log ``` ### Troubleshooting Logstash | Issue | Diagnostic | Solution | | --------------------- | --------------------------- | ------------------------------- | | No events processed | `systemctl status logstash` | Start/restart service | | Config syntax error | Test config command | Fix config file | | Grok failures | Check stats endpoint | Update grok patterns | | Wrong Bugsink project | Check environment tags | Verify tag routing | | Permission denied | `groups logstash` | Add to `postgres`, `adm` groups | | PM2 logs not captured | Check file paths | Verify log file existence | | High disk usage | Check log rotation | Configure logrotate | **Test Configuration**: ```bash /usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf ``` **Restart After Config Change**: ```bash systemctl restart logstash journalctl -u logstash -f # Watch for startup errors ``` --- ## PM2 Process Monitoring PM2 manages the Node.js application processes in production. ### Process Overview **Production Processes** (`ecosystem.config.cjs`): | Process Name | Script | Purpose | Instances | | -------------------------------- | ----------- | -------------------- | ------------------ | | `flyer-crawler-api` | `server.ts` | Express API server | Cluster (max CPUs) | | `flyer-crawler-worker` | `worker.ts` | BullMQ job processor | 1 | | `flyer-crawler-analytics-worker` | `worker.ts` | Analytics jobs | 1 | **Test Processes** (`ecosystem-test.config.cjs`): | Process Name | Script | Port | Instances | | ------------------------------------- | ----------- | ---- | ------------- | | `flyer-crawler-api-test` | `server.ts` | 3002 | 1 (fork mode) | | `flyer-crawler-worker-test` | `worker.ts` | N/A | 1 | | `flyer-crawler-analytics-worker-test` | `worker.ts` | N/A | 1 | ### Basic Commands ```bash ssh root@projectium.com su - gitea-runner # PM2 runs under this user # List all processes pm2 list # Process details pm2 show flyer-crawler-api # Monitor in real-time pm2 monit # View logs pm2 logs flyer-crawler-api pm2 logs flyer-crawler-worker --lines 100 # View all logs pm2 logs # Restart processes pm2 restart flyer-crawler-api pm2 restart all # Reload without downtime (cluster mode only) pm2 reload flyer-crawler-api # Stop processes pm2 stop flyer-crawler-api ``` ### Health Indicators **Healthy Process**: ``` ┌─────────────────────┬────┬─────────┬─────────┬───────┬────────┬─────────┬──────────┐ │ Name │ id │ mode │ status │ cpu │ mem │ uptime │ restarts │ ├─────────────────────┼────┼─────────┼─────────┼───────┼────────┼─────────┼──────────┤ │ flyer-crawler-api │ 0 │ cluster │ online │ 0.5% │ 150MB │ 5d │ 0 │ │ flyer-crawler-api │ 1 │ cluster │ online │ 0.3% │ 145MB │ 5d │ 0 │ │ flyer-crawler-worker│ 2 │ fork │ online │ 0.1% │ 200MB │ 5d │ 0 │ └─────────────────────┴────┴─────────┴─────────┴───────┴────────┴─────────┴──────────┘ ``` **Warning Signs**: - `status: errored` - Process crashed - High `restarts` count - Instability - High `mem` (>500MB for API, >1GB for workers) - Memory leak - Low `uptime` with high restarts - Repeated crashes ### Log File Locations | Process | stdout | stderr | | ---------------------- | ----------------------------------------------------------- | --------------- | | `flyer-crawler-api` | `/home/gitea-runner/.pm2/logs/flyer-crawler-api-out.log` | `...-error.log` | | `flyer-crawler-worker` | `/home/gitea-runner/.pm2/logs/flyer-crawler-worker-out.log` | `...-error.log` | ### Memory Management PM2 is configured to restart processes when they exceed memory limits: | Process | Memory Limit | Action | | ---------------- | ------------ | ------------ | | API | 500MB | Auto-restart | | Worker | 1GB | Auto-restart | | Analytics Worker | 1GB | Auto-restart | **Check Memory Usage**: ```bash pm2 show flyer-crawler-api | grep memory pm2 show flyer-crawler-worker | grep memory ``` ### Restart Strategies PM2 uses exponential backoff for restarts: ```javascript { max_restarts: 40, exp_backoff_restart_delay: 100, // Start at 100ms, exponentially increase min_uptime: '10s', // Must run 10s to be considered "started" } ``` **Force Restart After Repeated Failures**: ```bash pm2 delete flyer-crawler-api pm2 start ecosystem.config.cjs --only flyer-crawler-api ``` --- ## Database Monitoring ### Connection Pool Status The application uses a PostgreSQL connection pool with these defaults: | Setting | Value | Purpose | | ------------------------- | ----- | -------------------------------- | | `max` | 20 | Maximum concurrent connections | | `idleTimeoutMillis` | 30000 | Close idle connections after 30s | | `connectionTimeoutMillis` | 2000 | Fail if connection takes >2s | **Check Pool Status via API**: ```bash curl -s https://flyer-crawler.projectium.com/api/health/db-pool | jq . # Response { "success": true, "data": { "message": "Pool Status: 10 total, 8 idle, 0 waiting.", "totalCount": 10, "idleCount": 8, "waitingCount": 0 } } ``` **Pool Health Thresholds**: | Metric | Healthy | Warning | Critical | | ------------------- | ------- | ------- | ---------- | | Waiting Connections | 0-2 | 3-4 | 5+ | | Total Connections | 1-15 | 16-19 | 20 (maxed) | ### Slow Query Logging PostgreSQL is configured to log slow queries: ```ini # /etc/postgresql/14/main/conf.d/observability.conf log_min_duration_statement = 1000 # Log queries over 1 second ``` **View Slow Queries**: ```bash ssh root@projectium.com grep "duration:" /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | tail -20 ``` ### Database Size Monitoring ```bash # Connect to production database psql -h localhost -U flyer_crawler_prod -d flyer-crawler-prod # Database size SELECT pg_size_pretty(pg_database_size('flyer-crawler-prod')); # Table sizes SELECT relname AS table, pg_size_pretty(pg_total_relation_size(relid)) AS total_size, pg_size_pretty(pg_relation_size(relid)) AS data_size, pg_size_pretty(pg_indexes_size(relid)) AS index_size FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10; # Check for bloat SELECT schemaname, relname, n_dead_tup, n_live_tup, round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct FROM pg_stat_user_tables WHERE n_dead_tup > 1000 ORDER BY n_dead_tup DESC; ``` ### Disk Space Monitoring ```bash # Check PostgreSQL data directory du -sh /var/lib/postgresql/14/main/ # Check available disk space df -h /var/lib/postgresql/ # Estimate growth rate psql -c "SELECT date_trunc('day', created_at) as day, count(*) FROM flyer_items WHERE created_at > now() - interval '7 days' GROUP BY 1 ORDER BY 1;" ``` ### Database Health via MCP ```bash # Query database directly mcp__devdb__query --sql "SELECT count(*) FROM flyers WHERE created_at > now() - interval '1 day'" # Check connection count mcp__devdb__query --sql "SELECT count(*) FROM pg_stat_activity WHERE datname = 'flyer_crawler_dev'" ``` --- ## Redis Monitoring ### Basic Health Check ```bash # Via API endpoint curl -s https://flyer-crawler.projectium.com/api/health/redis | jq . # Direct Redis check (on server) redis-cli ping # Should return PONG ``` ### Memory Usage ```bash redis-cli info memory | grep -E "used_memory_human|maxmemory_human|mem_fragmentation_ratio" # Expected output used_memory_human:50.00M maxmemory_human:256.00M mem_fragmentation_ratio:1.05 ``` **Memory Thresholds**: | Metric | Healthy | Warning | Critical | | ------------------- | ----------- | ------- | -------- | | Used Memory | <70% of max | 70-85% | >85% | | Fragmentation Ratio | 1.0-1.5 | 1.5-2.0 | >2.0 | ### Cache Statistics ```bash redis-cli info stats | grep -E "keyspace_hits|keyspace_misses|evicted_keys" # Calculate hit rate # Hit Rate = keyspace_hits / (keyspace_hits + keyspace_misses) * 100 ``` **Cache Hit Rate Targets**: - Excellent: >95% - Good: 85-95% - Needs attention: <85% ### Queue Monitoring BullMQ queues are stored in Redis: ```bash # List all queues redis-cli keys "bull:*:id" # Check queue depths redis-cli llen "bull:flyer-processing:wait" redis-cli llen "bull:email-sending:wait" redis-cli llen "bull:analytics-reporting:wait" # Check failed jobs redis-cli llen "bull:flyer-processing:failed" ``` **Queue Depth Thresholds**: | Queue | Normal | Warning | Critical | | ------------------- | ------ | ------- | -------- | | flyer-processing | 0-10 | 11-50 | >50 | | email-sending | 0-100 | 101-500 | >500 | | analytics-reporting | 0-5 | 6-20 | >20 | ### Bull Board UI Access the job queue dashboard: - **Production**: `https://flyer-crawler.projectium.com/api/admin/jobs` (requires admin auth) - **Test**: `https://flyer-crawler-test.projectium.com/api/admin/jobs` - **Dev**: `http://localhost:3001/api/admin/jobs` Features: - View all queues and job counts - Inspect job data and errors - Retry failed jobs - Clean completed jobs ### Redis Database Allocation | Database | Purpose | | -------- | ------------------------ | | 0 | BullMQ production queues | | 1 | BullMQ test queues | | 15 | Bugsink sync state | --- ## Production Alerts and On-Call ### Critical Monitoring Targets | Service | Check | Interval | Alert Threshold | | ---------- | ------------------- | -------- | ---------------------- | | API Server | `/api/health/ready` | 1 min | 2 consecutive failures | | Database | Pool waiting count | 1 min | >5 waiting | | Redis | Memory usage | 5 min | >85% of maxmemory | | Disk Space | `/var/log` | 15 min | <10GB free | | Worker | Queue depth | 5 min | >50 jobs waiting | | Error Rate | Bugsink issue count | 15 min | >10 new issues/hour | ### Alert Channels Configure alerts in your monitoring tool (UptimeRobot, Datadog, etc.): 1. **Slack channel**: `#flyer-crawler-alerts` 2. **Email**: On-call rotation email 3. **PagerDuty**: Critical issues only ### On-Call Response Procedures **P1 - Critical (Site Down)**: 1. Acknowledge alert within 5 minutes 2. Check `/api/health/ready` - identify failing service 3. Check PM2 status: `pm2 list` 4. Check recent deploys: `git log -5 --oneline` 5. If database: check pool, restart if needed 6. If Redis: check memory, flush if critical 7. If application: restart PM2 processes 8. Document in incident channel **P2 - High (Degraded Service)**: 1. Acknowledge within 15 minutes 2. Review Bugsink for error patterns 3. Check system resources (CPU, memory, disk) 4. Identify root cause 5. Plan remediation 6. Create Gitea issue if not auto-created **P3 - Medium (Non-Critical)**: 1. Acknowledge within 1 hour 2. Review during business hours 3. Create Gitea issue for tracking ### Quick Diagnostic Commands ```bash # Full system health check ssh root@projectium.com << 'EOF' echo "=== Service Status ===" systemctl status pm2-gitea-runner --no-pager systemctl status logstash --no-pager systemctl status redis --no-pager systemctl status postgresql --no-pager echo "=== PM2 Processes ===" su - gitea-runner -c "pm2 list" echo "=== Disk Space ===" df -h / /var echo "=== Memory ===" free -h echo "=== Recent Errors ===" journalctl -p err -n 20 --no-pager EOF ``` ### Runbook Quick Reference | Symptom | First Action | If That Fails | | --------------- | ---------------- | --------------------- | | 503 errors | Restart PM2 | Check database, Redis | | Slow responses | Check DB pool | Review slow query log | | High error rate | Check Bugsink | Review recent deploys | | Queue backlog | Restart worker | Scale workers | | Out of memory | Restart process | Increase PM2 limit | | Disk full | Clean old logs | Expand volume | | Redis OOM | Flush cache keys | Increase maxmemory | ### Post-Incident Review After any P1/P2 incident: 1. Write incident report within 24 hours 2. Identify root cause 3. Document timeline of events 4. List action items to prevent recurrence 5. Schedule review meeting if needed 6. Update runbooks if new procedures discovered --- ## Related Documentation - [ADR-015: Application Performance Monitoring](../adr/0015-application-performance-monitoring-and-error-tracking.md) - [ADR-020: Health Checks](../adr/0020-health-checks-and-liveness-readiness-probes.md) - [ADR-050: PostgreSQL Function Observability](../adr/0050-postgresql-function-observability.md) - [ADR-053: Worker Health Checks](../adr/0053-worker-health-checks.md) - [DEV-CONTAINER-BUGSINK.md](../DEV-CONTAINER-BUGSINK.md) - [BUGSINK-SYNC.md](../BUGSINK-SYNC.md) - [LOGSTASH-QUICK-REF.md](LOGSTASH-QUICK-REF.md) - [LOGSTASH-TROUBLESHOOTING.md](LOGSTASH-TROUBLESHOOTING.md) - [LOGSTASH_DEPLOYMENT_CHECKLIST.md](../LOGSTASH_DEPLOYMENT_CHECKLIST.md)