torbo/flyer-crawler.projectium.com

Fork 0

Files

Torben Sorensen 4f06698dfd

Deploy to Test Environment / deploy-to-test (push) Failing after 2m50s

Details

test fixes and doc work

2026-01-28 15:33:48 -08:00

29 KiB

Raw Blame History

Monitoring Guide

This guide covers all aspects of monitoring the Flyer Crawler application across development, test, and production environments.

Health Checks
Bugsink Error Tracking
Logstash Log Aggregation
PM2 Process Monitoring
Database Monitoring
Redis Monitoring
Production Alerts and On-Call

Health Checks

The application exposes health check endpoints at /api/health/* implementing ADR-020.

Endpoint Reference

Endpoint	Purpose	Use Case
`/api/health/ping`	Simple connectivity	Quick "is it running?" check
`/api/health/live`	Liveness probe	Container orchestration restart trigger
`/api/health/ready`	Readiness probe	Load balancer traffic routing
`/api/health/startup`	Startup probe	Initial container readiness
`/api/health/db-schema`	Schema verification	Deployment validation
`/api/health/db-pool`	Connection pool status	Performance diagnostics
`/api/health/redis`	Redis connectivity	Cache/queue health
`/api/health/storage`	File storage access	Upload capability
`/api/health/time`	Server time sync	Time-sensitive operations

Liveness Probe (`/api/health/live`)

Returns 200 OK if the Node.js process is running. No external dependencies.

# Check liveness
curl -s https://flyer-crawler.projectium.com/api/health/live | jq .

# Expected response
{
  "success": true,
  "data": {
    "status": "ok",
    "timestamp": "2026-01-22T10:00:00.000Z"
  }
}

Usage: If this endpoint fails, restart the application immediately.

Readiness Probe (`/api/health/ready`)

Comprehensive check of all critical dependencies: database, Redis, and storage.

# Check readiness
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq .

# Expected healthy response (200)
{
  "success": true,
  "data": {
    "status": "healthy",
    "timestamp": "2026-01-22T10:00:00.000Z",
    "uptime": 3600.5,
    "services": {
      "database": {
        "status": "healthy",
        "latency": 5,
        "details": {
          "totalConnections": 10,
          "idleConnections": 8,
          "waitingConnections": 0
        }
      },
      "redis": {
        "status": "healthy",
        "latency": 2
      },
      "storage": {
        "status": "healthy",
        "latency": 1,
        "details": {
          "path": "/var/www/flyer-crawler.projectium.com/flyer-images"
        }
      }
    }
  }
}

Status Values:

Status	Meaning	Action
`healthy`	All critical services operational	None required
`degraded`	Non-critical issues (e.g., high connection wait)	Monitor closely
`unhealthy`	Critical service unavailable (returns 503)	Remove from load balancer

Database Health Thresholds

Metric	Healthy	Degraded	Unhealthy
Query response	`SELECT 1` succeeds	N/A	Connection fails
Waiting connections	0-3	4+	N/A

Verifying Services from CLI

Production:

# Quick health check
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'

# Database pool status
curl -s https://flyer-crawler.projectium.com/api/health/db-pool | jq .

# Redis health
curl -s https://flyer-crawler.projectium.com/api/health/redis | jq .

Test Environment:

# Test environment runs on port 3002
curl -s https://flyer-crawler-test.projectium.com/api/health/ready | jq .

Dev Container:

# From inside the container
curl -s http://localhost:3001/api/health/ready | jq .

# From Windows host (via port mapping)
curl -s http://localhost:3001/api/health/ready | jq .

Admin System Check UI

The admin dashboard at /admin includes a System Check component that runs all health checks with a visual interface:

Navigate to https://flyer-crawler.projectium.com/admin
Login with admin credentials
View the "System Check" section
Click "Re-run Checks" to verify all services

Checks include:

Backend Server Connection
PM2 Process Status
Database Connection Pool
Redis Connection
Database Schema
Default Admin User
Assets Storage Directory
Gemini API Key

Bugsink Error Tracking

Bugsink is our self-hosted, Sentry-compatible error tracking system (ADR-015).

Access Points

Environment	URL	Purpose
Production	`https://bugsink.projectium.com`	Production and test errors
Dev Container	`https://localhost:8443`	Local development errors

Credentials

Production Bugsink:

Credentials stored in password manager
Admin account created during initial deployment

Dev Container Bugsink:

Email: admin@localhost
Password: admin

Projects

Project ID	Name	Environment	Error Source
1	flyer-crawler-backend	Production	Backend Node.js errors
2	flyer-crawler-frontend	Production	Frontend JavaScript errors
3	flyer-crawler-backend-test	Test	Test environment backend
4	flyer-crawler-frontend-test	Test	Test environment frontend
5	flyer-crawler-infrastructure	Production	PostgreSQL, Redis, NGINX errors
6	flyer-crawler-test-infrastructure	Test	Test infra errors

Dev Container Projects (localhost:8000):

Project 1: Backend (Dev)
Project 2: Frontend (Dev)

Accessing Errors via Web UI

Navigate to the Bugsink URL
Login with credentials
Select project from the sidebar
Click on an issue to view details

Issue Details Include:

Exception type and message
Full stack trace
Request context (URL, method, headers)
User context (if authenticated)
Occurrence statistics (first seen, last seen, count)
Release/version information

Accessing Errors via MCP

Claude Code and other AI tools can access Bugsink via MCP servers.

Available MCP Tools:

# List all projects
mcp__bugsink__list_projects

# List unresolved issues for a project
mcp__bugsink__list_issues --project_id 1 --status unresolved

# Get issue details
mcp__bugsink__get_issue --issue_id <uuid>

# Get stacktrace (pre-rendered Markdown)
mcp__bugsink__get_stacktrace --event_id <uuid>

# List events for an issue
mcp__bugsink__list_events --issue_id <uuid>

MCP Server Configuration:

Production (in ~/.claude/settings.json):

{
  "bugsink": {
    "command": "node",
    "args": ["d:\\gitea\\bugsink-mcp\\dist\\index.js"],
    "env": {
      "BUGSINK_URL": "https://bugsink.projectium.com",
      "BUGSINK_TOKEN": "<token>"
    }
  }
}

Dev Container (in .mcp.json):

{
  "localerrors": {
    "command": "node",
    "args": ["d:\\gitea\\bugsink-mcp\\dist\\index.js"],
    "env": {
      "BUGSINK_URL": "http://127.0.0.1:8000",
      "BUGSINK_TOKEN": "<token>"
    }
  }
}

Creating API Tokens

Bugsink 2.0.11 does not have a UI for API tokens. Create via Django management command.

Production (user executes on server):

cd /opt/bugsink && bugsink-manage create_auth_token

Dev Container:

MSYS_NO_PATHCONV=1 podman exec -e DATABASE_URL=postgresql://bugsink:bugsink_dev_password@postgres:5432/bugsink -e SECRET_KEY=dev-bugsink-secret-key-minimum-50-characters-for-security flyer-crawler-dev sh -c 'cd /opt/bugsink/conf && DJANGO_SETTINGS_MODULE=bugsink_conf PYTHONPATH=/opt/bugsink/conf:/opt/bugsink/lib/python3.10/site-packages /opt/bugsink/bin/python -m django create_auth_token'

The command outputs a 40-character hex token.

Interpreting Errors

Error Anatomy:

TypeError: Cannot read properties of undefined (reading 'map')
├── Exception Type: TypeError
├── Message: Cannot read properties of undefined (reading 'map')
├── Where: FlyerItemsList.tsx:45:23
├── When: 2026-01-22T10:30:00.000Z
├── Count: 12 occurrences
└── Context:
    ├── URL: GET /api/flyers/123/items
    ├── User: user@example.com
    └── Release: v0.12.5

Common Error Patterns:

Pattern	Likely Cause	Investigation
`TypeError: ... undefined`	Missing null check, API returned unexpected shape	Check API response, add defensive coding
`DatabaseError: Connection timeout`	Pool exhaustion, slow queries	Check `/api/health/db-pool`, review slow query log
`RedisConnectionError`	Redis unavailable	Check Redis service, network connectivity
`ValidationError: ...`	Invalid input, schema mismatch	Review request payload, update validation
`NotFoundError: ...`	Missing resource	Verify resource exists, check ID format

Error Triage Workflow

Review new issues daily in Bugsink
Categorize by severity:
- Critical: Data corruption, security, payment failures
- High: Core feature broken for many users
- Medium: Feature degraded, workaround available
- Low: Minor UX issues, cosmetic bugs
Check occurrence count - frequent errors need urgent attention
Review stack trace - identify root cause
Check recent deployments - did a release introduce this?
Create Gitea issue if not auto-synced

Bugsink-to-Gitea Sync

The test environment automatically syncs Bugsink issues to Gitea (see docs/BUGSINK-SYNC.md).

Sync Workflow:

Runs every 15 minutes on test server
Fetches unresolved issues from all Bugsink projects
Creates Gitea issues with appropriate labels
Marks synced issues as resolved in Bugsink

Manual Sync:

# Trigger sync via API (test environment only)
curl -X POST https://flyer-crawler-test.projectium.com/api/admin/bugsink/sync \
  -H "Authorization: Bearer <admin_jwt>"

Logstash Log Aggregation

Logstash aggregates logs from multiple sources and forwards errors to Bugsink (ADR-050).

Architecture

Log Sources                    Logstash                  Outputs
┌──────────────┐              ┌─────────────┐           ┌─────────────┐
│ PostgreSQL   │──────────────│             │───────────│ Bugsink     │
│ PM2 Workers  │──────────────│   Filter    │───────────│ (errors)    │
│ Redis        │──────────────│   & Route   │───────────│             │
│ NGINX        │──────────────│             │───────────│ File Logs   │
└──────────────┘              └─────────────┘           │ (all logs)  │
                                                        └─────────────┘

Configuration Files

Path	Purpose
`/etc/logstash/conf.d/bugsink.conf`	Main pipeline configuration
`/etc/postgresql/14/main/conf.d/observability.conf`	PostgreSQL logging settings
`/var/log/logstash/`	Logstash file outputs
`/var/lib/logstash/sincedb_*`	File position tracking

Log Sources

Source	Path	Contents
PostgreSQL	`/var/log/postgresql/*.log`	Function logs, slow queries, errors
PM2 Workers	`/home/gitea-runner/.pm2/logs/flyer-crawler-*.log`	Worker stdout/stderr
Redis	`/var/log/redis/redis-server.log`	Connection errors, memory warnings
NGINX	`/var/log/nginx/access.log`, `error.log`	HTTP requests, upstream errors

Pipeline Status

Check Logstash Service (user executes on server):

# Service status
systemctl status logstash

# Recent logs
journalctl -u logstash -n 50 --no-pager

# Pipeline statistics
curl -s http://localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.events'

# Events processed today
curl -s http://localhost:9600/_node/stats/pipelines?pretty | jq '{
  in: .pipelines.main.events.in,
  out: .pipelines.main.events.out,
  filtered: .pipelines.main.events.filtered
}'

Check Filter Performance:

# Grok pattern success/failure rates
curl -s http://localhost:9600/_node/stats/pipelines?pretty | \
  jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | {name, events_in: .events.in, events_out: .events.out, failures}'

Viewing Aggregated Logs

# PM2 worker logs (all workers combined)
tail -f /var/log/logstash/pm2-workers-$(date +%Y-%m-%d).log

# Redis operational logs
tail -f /var/log/logstash/redis-operational-$(date +%Y-%m-%d).log

# NGINX access logs (parsed)
tail -f /var/log/logstash/nginx-access-$(date +%Y-%m-%d).log

# PostgreSQL function logs
tail -f /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log

Troubleshooting Logstash

Issue	Diagnostic	Solution
No events processed	`systemctl status logstash`	Start/restart service
Config syntax error	Test config command	Fix config file
Grok failures	Check stats endpoint	Update grok patterns
Wrong Bugsink project	Check environment tags	Verify tag routing
Permission denied	`groups logstash`	Add to `postgres`, `adm` groups
PM2 logs not captured	Check file paths	Verify log file existence
High disk usage	Check log rotation	Configure logrotate

Test Configuration:

/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf

Restart After Config Change:

systemctl restart logstash
journalctl -u logstash -f  # Watch for startup errors

PM2 Process Monitoring

PM2 manages the Node.js application processes in production.

Process Overview

Production Processes (ecosystem.config.cjs):

Process Name	Script	Purpose	Instances
`flyer-crawler-api`	`server.ts`	Express API server	Cluster (max CPUs)
`flyer-crawler-worker`	`worker.ts`	BullMQ job processor	1
`flyer-crawler-analytics-worker`	`worker.ts`	Analytics jobs	1

Test Processes (ecosystem-test.config.cjs):

Process Name	Script	Port	Instances
`flyer-crawler-api-test`	`server.ts`	3002	1 (fork mode)
`flyer-crawler-worker-test`	`worker.ts`	N/A	1
`flyer-crawler-analytics-worker-test`	`worker.ts`	N/A	1

Basic Commands

Note

: These commands are for the user to execute on the server. Claude Code provides commands but cannot run them directly.

# Switch to gitea-runner user (PM2 runs under this user)
su - gitea-runner

# List all processes
pm2 list

# Process details
pm2 show flyer-crawler-api

# Monitor in real-time
pm2 monit

# View logs
pm2 logs flyer-crawler-api
pm2 logs flyer-crawler-worker --lines 100

# View all logs
pm2 logs

# Restart processes
pm2 restart flyer-crawler-api
pm2 restart all

# Reload without downtime (cluster mode only)
pm2 reload flyer-crawler-api

# Stop processes
pm2 stop flyer-crawler-api

Health Indicators

Healthy Process:

┌─────────────────────┬────┬─────────┬─────────┬───────┬────────┬─────────┬──────────┐
│ Name                │ id │ mode    │ status  │ cpu   │ mem    │ uptime  │ restarts │
├─────────────────────┼────┼─────────┼─────────┼───────┼────────┼─────────┼──────────┤
│ flyer-crawler-api   │ 0  │ cluster │ online  │ 0.5%  │ 150MB  │ 5d      │ 0        │
│ flyer-crawler-api   │ 1  │ cluster │ online  │ 0.3%  │ 145MB  │ 5d      │ 0        │
│ flyer-crawler-worker│ 2  │ fork    │ online  │ 0.1%  │ 200MB  │ 5d      │ 0        │
└─────────────────────┴────┴─────────┴─────────┴───────┴────────┴─────────┴──────────┘

Warning Signs:

status: errored - Process crashed
High restarts count - Instability
High mem (>500MB for API, >1GB for workers) - Memory leak
Low uptime with high restarts - Repeated crashes

Log File Locations

Process	stdout	stderr
`flyer-crawler-api`	`/home/gitea-runner/.pm2/logs/flyer-crawler-api-out.log`	`...-error.log`
`flyer-crawler-worker`	`/home/gitea-runner/.pm2/logs/flyer-crawler-worker-out.log`	`...-error.log`

Memory Management

PM2 is configured to restart processes when they exceed memory limits:

Process	Memory Limit	Action
API	500MB	Auto-restart
Worker	1GB	Auto-restart
Analytics Worker	1GB	Auto-restart

Check Memory Usage:

pm2 show flyer-crawler-api | grep memory
pm2 show flyer-crawler-worker | grep memory

Restart Strategies

PM2 uses exponential backoff for restarts:

{
  max_restarts: 40,
  exp_backoff_restart_delay: 100,  // Start at 100ms, exponentially increase
  min_uptime: '10s',  // Must run 10s to be considered "started"
}

Force Restart After Repeated Failures:

pm2 delete flyer-crawler-api
pm2 start ecosystem.config.cjs --only flyer-crawler-api

Database Monitoring

Connection Pool Status

The application uses a PostgreSQL connection pool with these defaults:

Setting	Value	Purpose
`max`	20	Maximum concurrent connections
`idleTimeoutMillis`	30000	Close idle connections after 30s
`connectionTimeoutMillis`	2000	Fail if connection takes >2s

Check Pool Status via API:

curl -s https://flyer-crawler.projectium.com/api/health/db-pool | jq .

# Response
{
  "success": true,
  "data": {
    "message": "Pool Status: 10 total, 8 idle, 0 waiting.",
    "totalCount": 10,
    "idleCount": 8,
    "waitingCount": 0
  }
}

Pool Health Thresholds:

Metric	Healthy	Warning	Critical
Waiting Connections	0-2	3-4	5+
Total Connections	1-15	16-19	20 (maxed)

Slow Query Logging

PostgreSQL is configured to log slow queries:

# /etc/postgresql/14/main/conf.d/observability.conf
log_min_duration_statement = 1000  # Log queries over 1 second

View Slow Queries:

ssh root@projectium.com
grep "duration:" /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | tail -20

Database Size Monitoring

# Connect to production database
psql -h localhost -U flyer_crawler_prod -d flyer-crawler-prod

# Database size
SELECT pg_size_pretty(pg_database_size('flyer-crawler-prod'));

# Table sizes
SELECT
  relname AS table,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
  pg_size_pretty(pg_relation_size(relid)) AS data_size,
  pg_size_pretty(pg_indexes_size(relid)) AS index_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;

# Check for bloat
SELECT schemaname, relname, n_dead_tup, n_live_tup,
       round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

Disk Space Monitoring

# Check PostgreSQL data directory
du -sh /var/lib/postgresql/14/main/

# Check available disk space
df -h /var/lib/postgresql/

# Estimate growth rate
psql -c "SELECT date_trunc('day', created_at) as day, count(*)
         FROM flyer_items
         WHERE created_at > now() - interval '7 days'
         GROUP BY 1 ORDER BY 1;"

Database Health via MCP

# Query database directly
mcp__devdb__query --sql "SELECT count(*) FROM flyers WHERE created_at > now() - interval '1 day'"

# Check connection count
mcp__devdb__query --sql "SELECT count(*) FROM pg_stat_activity WHERE datname = 'flyer_crawler_dev'"

Redis Monitoring

Basic Health Check

# Via API endpoint
curl -s https://flyer-crawler.projectium.com/api/health/redis | jq .

# Direct Redis check (on server)
redis-cli ping  # Should return PONG

Memory Usage

redis-cli info memory | grep -E "used_memory_human|maxmemory_human|mem_fragmentation_ratio"

# Expected output
used_memory_human:50.00M
maxmemory_human:256.00M
mem_fragmentation_ratio:1.05

Memory Thresholds:

Metric	Healthy	Warning	Critical
Used Memory	<70% of max	70-85%	>85%
Fragmentation Ratio	1.0-1.5	1.5-2.0	>2.0

Cache Statistics

redis-cli info stats | grep -E "keyspace_hits|keyspace_misses|evicted_keys"

# Calculate hit rate
# Hit Rate = keyspace_hits / (keyspace_hits + keyspace_misses) * 100

Cache Hit Rate Targets:

Excellent: >95%
Good: 85-95%
Needs attention: <85%

Queue Monitoring

BullMQ queues are stored in Redis:

# List all queues
redis-cli keys "bull:*:id"

# Check queue depths
redis-cli llen "bull:flyer-processing:wait"
redis-cli llen "bull:email-sending:wait"
redis-cli llen "bull:analytics-reporting:wait"

# Check failed jobs
redis-cli llen "bull:flyer-processing:failed"

Queue Depth Thresholds:

Queue	Normal	Warning	Critical
flyer-processing	0-10	11-50	>50
email-sending	0-100	101-500	>500
analytics-reporting	0-5	6-20	>20

Bull Board UI

Access the job queue dashboard:

Production: https://flyer-crawler.projectium.com/api/admin/jobs (requires admin auth)
Test: https://flyer-crawler-test.projectium.com/api/admin/jobs
Dev: http://localhost:3001/api/admin/jobs

Features:

View all queues and job counts
Inspect job data and errors
Retry failed jobs
Clean completed jobs

Redis Database Allocation

Database	Purpose
0	BullMQ production queues
1	BullMQ test queues
15	Bugsink sync state

Production Alerts and On-Call

Critical Monitoring Targets

Service	Check	Interval	Alert Threshold
API Server	`/api/health/ready`	1 min	2 consecutive failures
Database	Pool waiting count	1 min	>5 waiting
Redis	Memory usage	5 min	>85% of maxmemory
Disk Space	`/var/log`	15 min	<10GB free
Worker	Queue depth	5 min	>50 jobs waiting
Error Rate	Bugsink issue count	15 min	>10 new issues/hour

Alert Channels

Configure alerts in your monitoring tool (UptimeRobot, Datadog, etc.):

Slack channel: #flyer-crawler-alerts
Email: On-call rotation email
PagerDuty: Critical issues only

On-Call Response Procedures

P1 - Critical (Site Down):

Acknowledge alert within 5 minutes
Check /api/health/ready - identify failing service
Check PM2 status: pm2 list
Check recent deploys: git log -5 --oneline
If database: check pool, restart if needed
If Redis: check memory, flush if critical
If application: restart PM2 processes
Document in incident channel

P2 - High (Degraded Service):

Acknowledge within 15 minutes
Review Bugsink for error patterns
Check system resources (CPU, memory, disk)
Identify root cause
Plan remediation
Create Gitea issue if not auto-created

P3 - Medium (Non-Critical):

Acknowledge within 1 hour
Review during business hours
Create Gitea issue for tracking

Quick Diagnostic Commands

Note

: User executes these commands on the server. Claude Code provides commands but cannot run them directly.

# Service status checks
systemctl status pm2-gitea-runner --no-pager
systemctl status logstash --no-pager
systemctl status redis --no-pager
systemctl status postgresql --no-pager

# PM2 processes (run as gitea-runner)
su - gitea-runner -c "pm2 list"

# Disk space
df -h / /var

# Memory
free -h

# Recent errors
journalctl -p err -n 20 --no-pager

Runbook Quick Reference

Symptom	First Action	If That Fails
503 errors	Restart PM2	Check database, Redis
Slow responses	Check DB pool	Review slow query log
High error rate	Check Bugsink	Review recent deploys
Queue backlog	Restart worker	Scale workers
Out of memory	Restart process	Increase PM2 limit
Disk full	Clean old logs	Expand volume
Redis OOM	Flush cache keys	Increase maxmemory

Post-Incident Review

After any P1/P2 incident:

Write incident report within 24 hours
Identify root cause
Document timeline of events
List action items to prevent recurrence
Schedule review meeting if needed
Update runbooks if new procedures discovered

29 KiB Raw Blame History