Files
flyer-crawler.projectium.com/docs/operations/MONITORING.md
Torben Sorensen 4f06698dfd
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 2m50s
test fixes and doc work
2026-01-28 15:33:48 -08:00

29 KiB

Monitoring Guide

This guide covers all aspects of monitoring the Flyer Crawler application across development, test, and production environments.

Table of Contents

  1. Health Checks
  2. Bugsink Error Tracking
  3. Logstash Log Aggregation
  4. PM2 Process Monitoring
  5. Database Monitoring
  6. Redis Monitoring
  7. Production Alerts and On-Call

Health Checks

The application exposes health check endpoints at /api/health/* implementing ADR-020.

Endpoint Reference

Endpoint Purpose Use Case
/api/health/ping Simple connectivity Quick "is it running?" check
/api/health/live Liveness probe Container orchestration restart trigger
/api/health/ready Readiness probe Load balancer traffic routing
/api/health/startup Startup probe Initial container readiness
/api/health/db-schema Schema verification Deployment validation
/api/health/db-pool Connection pool status Performance diagnostics
/api/health/redis Redis connectivity Cache/queue health
/api/health/storage File storage access Upload capability
/api/health/time Server time sync Time-sensitive operations

Liveness Probe (/api/health/live)

Returns 200 OK if the Node.js process is running. No external dependencies.

# Check liveness
curl -s https://flyer-crawler.projectium.com/api/health/live | jq .

# Expected response
{
  "success": true,
  "data": {
    "status": "ok",
    "timestamp": "2026-01-22T10:00:00.000Z"
  }
}

Usage: If this endpoint fails, restart the application immediately.

Readiness Probe (/api/health/ready)

Comprehensive check of all critical dependencies: database, Redis, and storage.

# Check readiness
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq .

# Expected healthy response (200)
{
  "success": true,
  "data": {
    "status": "healthy",
    "timestamp": "2026-01-22T10:00:00.000Z",
    "uptime": 3600.5,
    "services": {
      "database": {
        "status": "healthy",
        "latency": 5,
        "details": {
          "totalConnections": 10,
          "idleConnections": 8,
          "waitingConnections": 0
        }
      },
      "redis": {
        "status": "healthy",
        "latency": 2
      },
      "storage": {
        "status": "healthy",
        "latency": 1,
        "details": {
          "path": "/var/www/flyer-crawler.projectium.com/flyer-images"
        }
      }
    }
  }
}

Status Values:

Status Meaning Action
healthy All critical services operational None required
degraded Non-critical issues (e.g., high connection wait) Monitor closely
unhealthy Critical service unavailable (returns 503) Remove from load balancer

Database Health Thresholds

Metric Healthy Degraded Unhealthy
Query response SELECT 1 succeeds N/A Connection fails
Waiting connections 0-3 4+ N/A

Verifying Services from CLI

Production:

# Quick health check
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'

# Database pool status
curl -s https://flyer-crawler.projectium.com/api/health/db-pool | jq .

# Redis health
curl -s https://flyer-crawler.projectium.com/api/health/redis | jq .

Test Environment:

# Test environment runs on port 3002
curl -s https://flyer-crawler-test.projectium.com/api/health/ready | jq .

Dev Container:

# From inside the container
curl -s http://localhost:3001/api/health/ready | jq .

# From Windows host (via port mapping)
curl -s http://localhost:3001/api/health/ready | jq .

Admin System Check UI

The admin dashboard at /admin includes a System Check component that runs all health checks with a visual interface:

  1. Navigate to https://flyer-crawler.projectium.com/admin
  2. Login with admin credentials
  3. View the "System Check" section
  4. Click "Re-run Checks" to verify all services

Checks include:

  • Backend Server Connection
  • PM2 Process Status
  • Database Connection Pool
  • Redis Connection
  • Database Schema
  • Default Admin User
  • Assets Storage Directory
  • Gemini API Key

Bugsink Error Tracking

Bugsink is our self-hosted, Sentry-compatible error tracking system (ADR-015).

Access Points

Environment URL Purpose
Production https://bugsink.projectium.com Production and test errors
Dev Container https://localhost:8443 Local development errors

Credentials

Production Bugsink:

  • Credentials stored in password manager
  • Admin account created during initial deployment

Dev Container Bugsink:

  • Email: admin@localhost
  • Password: admin

Projects

Project ID Name Environment Error Source
1 flyer-crawler-backend Production Backend Node.js errors
2 flyer-crawler-frontend Production Frontend JavaScript errors
3 flyer-crawler-backend-test Test Test environment backend
4 flyer-crawler-frontend-test Test Test environment frontend
5 flyer-crawler-infrastructure Production PostgreSQL, Redis, NGINX errors
6 flyer-crawler-test-infrastructure Test Test infra errors

Dev Container Projects (localhost:8000):

  • Project 1: Backend (Dev)
  • Project 2: Frontend (Dev)

Accessing Errors via Web UI

  1. Navigate to the Bugsink URL
  2. Login with credentials
  3. Select project from the sidebar
  4. Click on an issue to view details

Issue Details Include:

  • Exception type and message
  • Full stack trace
  • Request context (URL, method, headers)
  • User context (if authenticated)
  • Occurrence statistics (first seen, last seen, count)
  • Release/version information

Accessing Errors via MCP

Claude Code and other AI tools can access Bugsink via MCP servers.

Available MCP Tools:

# List all projects
mcp__bugsink__list_projects

# List unresolved issues for a project
mcp__bugsink__list_issues --project_id 1 --status unresolved

# Get issue details
mcp__bugsink__get_issue --issue_id <uuid>

# Get stacktrace (pre-rendered Markdown)
mcp__bugsink__get_stacktrace --event_id <uuid>

# List events for an issue
mcp__bugsink__list_events --issue_id <uuid>

MCP Server Configuration:

Production (in ~/.claude/settings.json):

{
  "bugsink": {
    "command": "node",
    "args": ["d:\\gitea\\bugsink-mcp\\dist\\index.js"],
    "env": {
      "BUGSINK_URL": "https://bugsink.projectium.com",
      "BUGSINK_TOKEN": "<token>"
    }
  }
}

Dev Container (in .mcp.json):

{
  "localerrors": {
    "command": "node",
    "args": ["d:\\gitea\\bugsink-mcp\\dist\\index.js"],
    "env": {
      "BUGSINK_URL": "http://127.0.0.1:8000",
      "BUGSINK_TOKEN": "<token>"
    }
  }
}

Creating API Tokens

Bugsink 2.0.11 does not have a UI for API tokens. Create via Django management command.

Production (user executes on server):

cd /opt/bugsink && bugsink-manage create_auth_token

Dev Container:

MSYS_NO_PATHCONV=1 podman exec -e DATABASE_URL=postgresql://bugsink:bugsink_dev_password@postgres:5432/bugsink -e SECRET_KEY=dev-bugsink-secret-key-minimum-50-characters-for-security flyer-crawler-dev sh -c 'cd /opt/bugsink/conf && DJANGO_SETTINGS_MODULE=bugsink_conf PYTHONPATH=/opt/bugsink/conf:/opt/bugsink/lib/python3.10/site-packages /opt/bugsink/bin/python -m django create_auth_token'

The command outputs a 40-character hex token.

Interpreting Errors

Error Anatomy:

TypeError: Cannot read properties of undefined (reading 'map')
├── Exception Type: TypeError
├── Message: Cannot read properties of undefined (reading 'map')
├── Where: FlyerItemsList.tsx:45:23
├── When: 2026-01-22T10:30:00.000Z
├── Count: 12 occurrences
└── Context:
    ├── URL: GET /api/flyers/123/items
    ├── User: user@example.com
    └── Release: v0.12.5

Common Error Patterns:

Pattern Likely Cause Investigation
TypeError: ... undefined Missing null check, API returned unexpected shape Check API response, add defensive coding
DatabaseError: Connection timeout Pool exhaustion, slow queries Check /api/health/db-pool, review slow query log
RedisConnectionError Redis unavailable Check Redis service, network connectivity
ValidationError: ... Invalid input, schema mismatch Review request payload, update validation
NotFoundError: ... Missing resource Verify resource exists, check ID format

Error Triage Workflow

  1. Review new issues daily in Bugsink
  2. Categorize by severity:
    • Critical: Data corruption, security, payment failures
    • High: Core feature broken for many users
    • Medium: Feature degraded, workaround available
    • Low: Minor UX issues, cosmetic bugs
  3. Check occurrence count - frequent errors need urgent attention
  4. Review stack trace - identify root cause
  5. Check recent deployments - did a release introduce this?
  6. Create Gitea issue if not auto-synced

Bugsink-to-Gitea Sync

The test environment automatically syncs Bugsink issues to Gitea (see docs/BUGSINK-SYNC.md).

Sync Workflow:

  1. Runs every 15 minutes on test server
  2. Fetches unresolved issues from all Bugsink projects
  3. Creates Gitea issues with appropriate labels
  4. Marks synced issues as resolved in Bugsink

Manual Sync:

# Trigger sync via API (test environment only)
curl -X POST https://flyer-crawler-test.projectium.com/api/admin/bugsink/sync \
  -H "Authorization: Bearer <admin_jwt>"

Logstash Log Aggregation

Logstash aggregates logs from multiple sources and forwards errors to Bugsink (ADR-050).

Architecture

Log Sources                    Logstash                  Outputs
┌──────────────┐              ┌─────────────┐           ┌─────────────┐
│ PostgreSQL   │──────────────│             │───────────│ Bugsink     │
│ PM2 Workers  │──────────────│   Filter    │───────────│ (errors)    │
│ Redis        │──────────────│   & Route   │───────────│             │
│ NGINX        │──────────────│             │───────────│ File Logs   │
└──────────────┘              └─────────────┘           │ (all logs)  │
                                                        └─────────────┘

Configuration Files

Path Purpose
/etc/logstash/conf.d/bugsink.conf Main pipeline configuration
/etc/postgresql/14/main/conf.d/observability.conf PostgreSQL logging settings
/var/log/logstash/ Logstash file outputs
/var/lib/logstash/sincedb_* File position tracking

Log Sources

Source Path Contents
PostgreSQL /var/log/postgresql/*.log Function logs, slow queries, errors
PM2 Workers /home/gitea-runner/.pm2/logs/flyer-crawler-*.log Worker stdout/stderr
Redis /var/log/redis/redis-server.log Connection errors, memory warnings
NGINX /var/log/nginx/access.log, error.log HTTP requests, upstream errors

Pipeline Status

Check Logstash Service (user executes on server):

# Service status
systemctl status logstash

# Recent logs
journalctl -u logstash -n 50 --no-pager

# Pipeline statistics
curl -s http://localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.events'

# Events processed today
curl -s http://localhost:9600/_node/stats/pipelines?pretty | jq '{
  in: .pipelines.main.events.in,
  out: .pipelines.main.events.out,
  filtered: .pipelines.main.events.filtered
}'

Check Filter Performance:

# Grok pattern success/failure rates
curl -s http://localhost:9600/_node/stats/pipelines?pretty | \
  jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | {name, events_in: .events.in, events_out: .events.out, failures}'

Viewing Aggregated Logs

# PM2 worker logs (all workers combined)
tail -f /var/log/logstash/pm2-workers-$(date +%Y-%m-%d).log

# Redis operational logs
tail -f /var/log/logstash/redis-operational-$(date +%Y-%m-%d).log

# NGINX access logs (parsed)
tail -f /var/log/logstash/nginx-access-$(date +%Y-%m-%d).log

# PostgreSQL function logs
tail -f /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log

Troubleshooting Logstash

Issue Diagnostic Solution
No events processed systemctl status logstash Start/restart service
Config syntax error Test config command Fix config file
Grok failures Check stats endpoint Update grok patterns
Wrong Bugsink project Check environment tags Verify tag routing
Permission denied groups logstash Add to postgres, adm groups
PM2 logs not captured Check file paths Verify log file existence
High disk usage Check log rotation Configure logrotate

Test Configuration:

/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf

Restart After Config Change:

systemctl restart logstash
journalctl -u logstash -f  # Watch for startup errors

PM2 Process Monitoring

PM2 manages the Node.js application processes in production.

Process Overview

Production Processes (ecosystem.config.cjs):

Process Name Script Purpose Instances
flyer-crawler-api server.ts Express API server Cluster (max CPUs)
flyer-crawler-worker worker.ts BullMQ job processor 1
flyer-crawler-analytics-worker worker.ts Analytics jobs 1

Test Processes (ecosystem-test.config.cjs):

Process Name Script Port Instances
flyer-crawler-api-test server.ts 3002 1 (fork mode)
flyer-crawler-worker-test worker.ts N/A 1
flyer-crawler-analytics-worker-test worker.ts N/A 1

Basic Commands

Note

: These commands are for the user to execute on the server. Claude Code provides commands but cannot run them directly.

# Switch to gitea-runner user (PM2 runs under this user)
su - gitea-runner

# List all processes
pm2 list

# Process details
pm2 show flyer-crawler-api

# Monitor in real-time
pm2 monit

# View logs
pm2 logs flyer-crawler-api
pm2 logs flyer-crawler-worker --lines 100

# View all logs
pm2 logs

# Restart processes
pm2 restart flyer-crawler-api
pm2 restart all

# Reload without downtime (cluster mode only)
pm2 reload flyer-crawler-api

# Stop processes
pm2 stop flyer-crawler-api

Health Indicators

Healthy Process:

┌─────────────────────┬────┬─────────┬─────────┬───────┬────────┬─────────┬──────────┐
│ Name                │ id │ mode    │ status  │ cpu   │ mem    │ uptime  │ restarts │
├─────────────────────┼────┼─────────┼─────────┼───────┼────────┼─────────┼──────────┤
│ flyer-crawler-api   │ 0  │ cluster │ online  │ 0.5%  │ 150MB  │ 5d      │ 0        │
│ flyer-crawler-api   │ 1  │ cluster │ online  │ 0.3%  │ 145MB  │ 5d      │ 0        │
│ flyer-crawler-worker│ 2  │ fork    │ online  │ 0.1%  │ 200MB  │ 5d      │ 0        │
└─────────────────────┴────┴─────────┴─────────┴───────┴────────┴─────────┴──────────┘

Warning Signs:

  • status: errored - Process crashed
  • High restarts count - Instability
  • High mem (>500MB for API, >1GB for workers) - Memory leak
  • Low uptime with high restarts - Repeated crashes

Log File Locations

Process stdout stderr
flyer-crawler-api /home/gitea-runner/.pm2/logs/flyer-crawler-api-out.log ...-error.log
flyer-crawler-worker /home/gitea-runner/.pm2/logs/flyer-crawler-worker-out.log ...-error.log

Memory Management

PM2 is configured to restart processes when they exceed memory limits:

Process Memory Limit Action
API 500MB Auto-restart
Worker 1GB Auto-restart
Analytics Worker 1GB Auto-restart

Check Memory Usage:

pm2 show flyer-crawler-api | grep memory
pm2 show flyer-crawler-worker | grep memory

Restart Strategies

PM2 uses exponential backoff for restarts:

{
  max_restarts: 40,
  exp_backoff_restart_delay: 100,  // Start at 100ms, exponentially increase
  min_uptime: '10s',  // Must run 10s to be considered "started"
}

Force Restart After Repeated Failures:

pm2 delete flyer-crawler-api
pm2 start ecosystem.config.cjs --only flyer-crawler-api

Database Monitoring

Connection Pool Status

The application uses a PostgreSQL connection pool with these defaults:

Setting Value Purpose
max 20 Maximum concurrent connections
idleTimeoutMillis 30000 Close idle connections after 30s
connectionTimeoutMillis 2000 Fail if connection takes >2s

Check Pool Status via API:

curl -s https://flyer-crawler.projectium.com/api/health/db-pool | jq .

# Response
{
  "success": true,
  "data": {
    "message": "Pool Status: 10 total, 8 idle, 0 waiting.",
    "totalCount": 10,
    "idleCount": 8,
    "waitingCount": 0
  }
}

Pool Health Thresholds:

Metric Healthy Warning Critical
Waiting Connections 0-2 3-4 5+
Total Connections 1-15 16-19 20 (maxed)

Slow Query Logging

PostgreSQL is configured to log slow queries:

# /etc/postgresql/14/main/conf.d/observability.conf
log_min_duration_statement = 1000  # Log queries over 1 second

View Slow Queries:

ssh root@projectium.com
grep "duration:" /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | tail -20

Database Size Monitoring

# Connect to production database
psql -h localhost -U flyer_crawler_prod -d flyer-crawler-prod

# Database size
SELECT pg_size_pretty(pg_database_size('flyer-crawler-prod'));

# Table sizes
SELECT
  relname AS table,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
  pg_size_pretty(pg_relation_size(relid)) AS data_size,
  pg_size_pretty(pg_indexes_size(relid)) AS index_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;

# Check for bloat
SELECT schemaname, relname, n_dead_tup, n_live_tup,
       round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

Disk Space Monitoring

# Check PostgreSQL data directory
du -sh /var/lib/postgresql/14/main/

# Check available disk space
df -h /var/lib/postgresql/

# Estimate growth rate
psql -c "SELECT date_trunc('day', created_at) as day, count(*)
         FROM flyer_items
         WHERE created_at > now() - interval '7 days'
         GROUP BY 1 ORDER BY 1;"

Database Health via MCP

# Query database directly
mcp__devdb__query --sql "SELECT count(*) FROM flyers WHERE created_at > now() - interval '1 day'"

# Check connection count
mcp__devdb__query --sql "SELECT count(*) FROM pg_stat_activity WHERE datname = 'flyer_crawler_dev'"

Redis Monitoring

Basic Health Check

# Via API endpoint
curl -s https://flyer-crawler.projectium.com/api/health/redis | jq .

# Direct Redis check (on server)
redis-cli ping  # Should return PONG

Memory Usage

redis-cli info memory | grep -E "used_memory_human|maxmemory_human|mem_fragmentation_ratio"

# Expected output
used_memory_human:50.00M
maxmemory_human:256.00M
mem_fragmentation_ratio:1.05

Memory Thresholds:

Metric Healthy Warning Critical
Used Memory <70% of max 70-85% >85%
Fragmentation Ratio 1.0-1.5 1.5-2.0 >2.0

Cache Statistics

redis-cli info stats | grep -E "keyspace_hits|keyspace_misses|evicted_keys"

# Calculate hit rate
# Hit Rate = keyspace_hits / (keyspace_hits + keyspace_misses) * 100

Cache Hit Rate Targets:

  • Excellent: >95%
  • Good: 85-95%
  • Needs attention: <85%

Queue Monitoring

BullMQ queues are stored in Redis:

# List all queues
redis-cli keys "bull:*:id"

# Check queue depths
redis-cli llen "bull:flyer-processing:wait"
redis-cli llen "bull:email-sending:wait"
redis-cli llen "bull:analytics-reporting:wait"

# Check failed jobs
redis-cli llen "bull:flyer-processing:failed"

Queue Depth Thresholds:

Queue Normal Warning Critical
flyer-processing 0-10 11-50 >50
email-sending 0-100 101-500 >500
analytics-reporting 0-5 6-20 >20

Bull Board UI

Access the job queue dashboard:

  • Production: https://flyer-crawler.projectium.com/api/admin/jobs (requires admin auth)
  • Test: https://flyer-crawler-test.projectium.com/api/admin/jobs
  • Dev: http://localhost:3001/api/admin/jobs

Features:

  • View all queues and job counts
  • Inspect job data and errors
  • Retry failed jobs
  • Clean completed jobs

Redis Database Allocation

Database Purpose
0 BullMQ production queues
1 BullMQ test queues
15 Bugsink sync state

Production Alerts and On-Call

Critical Monitoring Targets

Service Check Interval Alert Threshold
API Server /api/health/ready 1 min 2 consecutive failures
Database Pool waiting count 1 min >5 waiting
Redis Memory usage 5 min >85% of maxmemory
Disk Space /var/log 15 min <10GB free
Worker Queue depth 5 min >50 jobs waiting
Error Rate Bugsink issue count 15 min >10 new issues/hour

Alert Channels

Configure alerts in your monitoring tool (UptimeRobot, Datadog, etc.):

  1. Slack channel: #flyer-crawler-alerts
  2. Email: On-call rotation email
  3. PagerDuty: Critical issues only

On-Call Response Procedures

P1 - Critical (Site Down):

  1. Acknowledge alert within 5 minutes
  2. Check /api/health/ready - identify failing service
  3. Check PM2 status: pm2 list
  4. Check recent deploys: git log -5 --oneline
  5. If database: check pool, restart if needed
  6. If Redis: check memory, flush if critical
  7. If application: restart PM2 processes
  8. Document in incident channel

P2 - High (Degraded Service):

  1. Acknowledge within 15 minutes
  2. Review Bugsink for error patterns
  3. Check system resources (CPU, memory, disk)
  4. Identify root cause
  5. Plan remediation
  6. Create Gitea issue if not auto-created

P3 - Medium (Non-Critical):

  1. Acknowledge within 1 hour
  2. Review during business hours
  3. Create Gitea issue for tracking

Quick Diagnostic Commands

Note

: User executes these commands on the server. Claude Code provides commands but cannot run them directly.

# Service status checks
systemctl status pm2-gitea-runner --no-pager
systemctl status logstash --no-pager
systemctl status redis --no-pager
systemctl status postgresql --no-pager

# PM2 processes (run as gitea-runner)
su - gitea-runner -c "pm2 list"

# Disk space
df -h / /var

# Memory
free -h

# Recent errors
journalctl -p err -n 20 --no-pager

Runbook Quick Reference

Symptom First Action If That Fails
503 errors Restart PM2 Check database, Redis
Slow responses Check DB pool Review slow query log
High error rate Check Bugsink Review recent deploys
Queue backlog Restart worker Scale workers
Out of memory Restart process Increase PM2 limit
Disk full Clean old logs Expand volume
Redis OOM Flush cache keys Increase maxmemory

Post-Incident Review

After any P1/P2 incident:

  1. Write incident report within 24 hours
  2. Identify root cause
  3. Document timeline of events
  4. List action items to prevent recurrence
  5. Schedule review meeting if needed
  6. Update runbooks if new procedures discovered