29 KiB
Monitoring Guide
This guide covers all aspects of monitoring the Flyer Crawler application across development, test, and production environments.
Table of Contents
- Health Checks
- Bugsink Error Tracking
- Logstash Log Aggregation
- PM2 Process Monitoring
- Database Monitoring
- Redis Monitoring
- Production Alerts and On-Call
Health Checks
The application exposes health check endpoints at /api/health/* implementing ADR-020.
Endpoint Reference
| Endpoint | Purpose | Use Case |
|---|---|---|
/api/health/ping |
Simple connectivity | Quick "is it running?" check |
/api/health/live |
Liveness probe | Container orchestration restart trigger |
/api/health/ready |
Readiness probe | Load balancer traffic routing |
/api/health/startup |
Startup probe | Initial container readiness |
/api/health/db-schema |
Schema verification | Deployment validation |
/api/health/db-pool |
Connection pool status | Performance diagnostics |
/api/health/redis |
Redis connectivity | Cache/queue health |
/api/health/storage |
File storage access | Upload capability |
/api/health/time |
Server time sync | Time-sensitive operations |
Liveness Probe (/api/health/live)
Returns 200 OK if the Node.js process is running. No external dependencies.
# Check liveness
curl -s https://flyer-crawler.projectium.com/api/health/live | jq .
# Expected response
{
"success": true,
"data": {
"status": "ok",
"timestamp": "2026-01-22T10:00:00.000Z"
}
}
Usage: If this endpoint fails, restart the application immediately.
Readiness Probe (/api/health/ready)
Comprehensive check of all critical dependencies: database, Redis, and storage.
# Check readiness
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq .
# Expected healthy response (200)
{
"success": true,
"data": {
"status": "healthy",
"timestamp": "2026-01-22T10:00:00.000Z",
"uptime": 3600.5,
"services": {
"database": {
"status": "healthy",
"latency": 5,
"details": {
"totalConnections": 10,
"idleConnections": 8,
"waitingConnections": 0
}
},
"redis": {
"status": "healthy",
"latency": 2
},
"storage": {
"status": "healthy",
"latency": 1,
"details": {
"path": "/var/www/flyer-crawler.projectium.com/flyer-images"
}
}
}
}
}
Status Values:
| Status | Meaning | Action |
|---|---|---|
healthy |
All critical services operational | None required |
degraded |
Non-critical issues (e.g., high connection wait) | Monitor closely |
unhealthy |
Critical service unavailable (returns 503) | Remove from load balancer |
Database Health Thresholds
| Metric | Healthy | Degraded | Unhealthy |
|---|---|---|---|
| Query response | SELECT 1 succeeds |
N/A | Connection fails |
| Waiting connections | 0-3 | 4+ | N/A |
Verifying Services from CLI
Production:
# Quick health check
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'
# Database pool status
curl -s https://flyer-crawler.projectium.com/api/health/db-pool | jq .
# Redis health
curl -s https://flyer-crawler.projectium.com/api/health/redis | jq .
Test Environment:
# Test environment runs on port 3002
curl -s https://flyer-crawler-test.projectium.com/api/health/ready | jq .
Dev Container:
# From inside the container
curl -s http://localhost:3001/api/health/ready | jq .
# From Windows host (via port mapping)
curl -s http://localhost:3001/api/health/ready | jq .
Admin System Check UI
The admin dashboard at /admin includes a System Check component that runs all health checks with a visual interface:
- Navigate to
https://flyer-crawler.projectium.com/admin - Login with admin credentials
- View the "System Check" section
- Click "Re-run Checks" to verify all services
Checks include:
- Backend Server Connection
- PM2 Process Status
- Database Connection Pool
- Redis Connection
- Database Schema
- Default Admin User
- Assets Storage Directory
- Gemini API Key
Bugsink Error Tracking
Bugsink is our self-hosted, Sentry-compatible error tracking system (ADR-015).
Access Points
| Environment | URL | Purpose |
|---|---|---|
| Production | https://bugsink.projectium.com |
Production and test errors |
| Dev Container | https://localhost:8443 |
Local development errors |
Credentials
Production Bugsink:
- Credentials stored in password manager
- Admin account created during initial deployment
Dev Container Bugsink:
- Email:
admin@localhost - Password:
admin
Projects
| Project ID | Name | Environment | Error Source |
|---|---|---|---|
| 1 | flyer-crawler-backend | Production | Backend Node.js errors |
| 2 | flyer-crawler-frontend | Production | Frontend JavaScript errors |
| 3 | flyer-crawler-backend-test | Test | Test environment backend |
| 4 | flyer-crawler-frontend-test | Test | Test environment frontend |
| 5 | flyer-crawler-infrastructure | Production | PostgreSQL, Redis, NGINX errors |
| 6 | flyer-crawler-test-infrastructure | Test | Test infra errors |
Dev Container Projects (localhost:8000):
- Project 1: Backend (Dev)
- Project 2: Frontend (Dev)
Accessing Errors via Web UI
- Navigate to the Bugsink URL
- Login with credentials
- Select project from the sidebar
- Click on an issue to view details
Issue Details Include:
- Exception type and message
- Full stack trace
- Request context (URL, method, headers)
- User context (if authenticated)
- Occurrence statistics (first seen, last seen, count)
- Release/version information
Accessing Errors via MCP
Claude Code and other AI tools can access Bugsink via MCP servers.
Available MCP Tools:
# List all projects
mcp__bugsink__list_projects
# List unresolved issues for a project
mcp__bugsink__list_issues --project_id 1 --status unresolved
# Get issue details
mcp__bugsink__get_issue --issue_id <uuid>
# Get stacktrace (pre-rendered Markdown)
mcp__bugsink__get_stacktrace --event_id <uuid>
# List events for an issue
mcp__bugsink__list_events --issue_id <uuid>
MCP Server Configuration:
Production (in ~/.claude/settings.json):
{
"bugsink": {
"command": "node",
"args": ["d:\\gitea\\bugsink-mcp\\dist\\index.js"],
"env": {
"BUGSINK_URL": "https://bugsink.projectium.com",
"BUGSINK_TOKEN": "<token>"
}
}
}
Dev Container (in .mcp.json):
{
"localerrors": {
"command": "node",
"args": ["d:\\gitea\\bugsink-mcp\\dist\\index.js"],
"env": {
"BUGSINK_URL": "http://127.0.0.1:8000",
"BUGSINK_TOKEN": "<token>"
}
}
}
Creating API Tokens
Bugsink 2.0.11 does not have a UI for API tokens. Create via Django management command.
Production (user executes on server):
cd /opt/bugsink && bugsink-manage create_auth_token
Dev Container:
MSYS_NO_PATHCONV=1 podman exec -e DATABASE_URL=postgresql://bugsink:bugsink_dev_password@postgres:5432/bugsink -e SECRET_KEY=dev-bugsink-secret-key-minimum-50-characters-for-security flyer-crawler-dev sh -c 'cd /opt/bugsink/conf && DJANGO_SETTINGS_MODULE=bugsink_conf PYTHONPATH=/opt/bugsink/conf:/opt/bugsink/lib/python3.10/site-packages /opt/bugsink/bin/python -m django create_auth_token'
The command outputs a 40-character hex token.
Interpreting Errors
Error Anatomy:
TypeError: Cannot read properties of undefined (reading 'map')
├── Exception Type: TypeError
├── Message: Cannot read properties of undefined (reading 'map')
├── Where: FlyerItemsList.tsx:45:23
├── When: 2026-01-22T10:30:00.000Z
├── Count: 12 occurrences
└── Context:
├── URL: GET /api/flyers/123/items
├── User: user@example.com
└── Release: v0.12.5
Common Error Patterns:
| Pattern | Likely Cause | Investigation |
|---|---|---|
TypeError: ... undefined |
Missing null check, API returned unexpected shape | Check API response, add defensive coding |
DatabaseError: Connection timeout |
Pool exhaustion, slow queries | Check /api/health/db-pool, review slow query log |
RedisConnectionError |
Redis unavailable | Check Redis service, network connectivity |
ValidationError: ... |
Invalid input, schema mismatch | Review request payload, update validation |
NotFoundError: ... |
Missing resource | Verify resource exists, check ID format |
Error Triage Workflow
- Review new issues daily in Bugsink
- Categorize by severity:
- Critical: Data corruption, security, payment failures
- High: Core feature broken for many users
- Medium: Feature degraded, workaround available
- Low: Minor UX issues, cosmetic bugs
- Check occurrence count - frequent errors need urgent attention
- Review stack trace - identify root cause
- Check recent deployments - did a release introduce this?
- Create Gitea issue if not auto-synced
Bugsink-to-Gitea Sync
The test environment automatically syncs Bugsink issues to Gitea (see docs/BUGSINK-SYNC.md).
Sync Workflow:
- Runs every 15 minutes on test server
- Fetches unresolved issues from all Bugsink projects
- Creates Gitea issues with appropriate labels
- Marks synced issues as resolved in Bugsink
Manual Sync:
# Trigger sync via API (test environment only)
curl -X POST https://flyer-crawler-test.projectium.com/api/admin/bugsink/sync \
-H "Authorization: Bearer <admin_jwt>"
Logstash Log Aggregation
Logstash aggregates logs from multiple sources and forwards errors to Bugsink (ADR-050).
Architecture
Log Sources Logstash Outputs
┌──────────────┐ ┌─────────────┐ ┌─────────────┐
│ PostgreSQL │──────────────│ │───────────│ Bugsink │
│ PM2 Workers │──────────────│ Filter │───────────│ (errors) │
│ Redis │──────────────│ & Route │───────────│ │
│ NGINX │──────────────│ │───────────│ File Logs │
└──────────────┘ └─────────────┘ │ (all logs) │
└─────────────┘
Configuration Files
| Path | Purpose |
|---|---|
/etc/logstash/conf.d/bugsink.conf |
Main pipeline configuration |
/etc/postgresql/14/main/conf.d/observability.conf |
PostgreSQL logging settings |
/var/log/logstash/ |
Logstash file outputs |
/var/lib/logstash/sincedb_* |
File position tracking |
Log Sources
| Source | Path | Contents |
|---|---|---|
| PostgreSQL | /var/log/postgresql/*.log |
Function logs, slow queries, errors |
| PM2 Workers | /home/gitea-runner/.pm2/logs/flyer-crawler-*.log |
Worker stdout/stderr |
| Redis | /var/log/redis/redis-server.log |
Connection errors, memory warnings |
| NGINX | /var/log/nginx/access.log, error.log |
HTTP requests, upstream errors |
Pipeline Status
Check Logstash Service (user executes on server):
# Service status
systemctl status logstash
# Recent logs
journalctl -u logstash -n 50 --no-pager
# Pipeline statistics
curl -s http://localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.events'
# Events processed today
curl -s http://localhost:9600/_node/stats/pipelines?pretty | jq '{
in: .pipelines.main.events.in,
out: .pipelines.main.events.out,
filtered: .pipelines.main.events.filtered
}'
Check Filter Performance:
# Grok pattern success/failure rates
curl -s http://localhost:9600/_node/stats/pipelines?pretty | \
jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | {name, events_in: .events.in, events_out: .events.out, failures}'
Viewing Aggregated Logs
# PM2 worker logs (all workers combined)
tail -f /var/log/logstash/pm2-workers-$(date +%Y-%m-%d).log
# Redis operational logs
tail -f /var/log/logstash/redis-operational-$(date +%Y-%m-%d).log
# NGINX access logs (parsed)
tail -f /var/log/logstash/nginx-access-$(date +%Y-%m-%d).log
# PostgreSQL function logs
tail -f /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log
Troubleshooting Logstash
| Issue | Diagnostic | Solution |
|---|---|---|
| No events processed | systemctl status logstash |
Start/restart service |
| Config syntax error | Test config command | Fix config file |
| Grok failures | Check stats endpoint | Update grok patterns |
| Wrong Bugsink project | Check environment tags | Verify tag routing |
| Permission denied | groups logstash |
Add to postgres, adm groups |
| PM2 logs not captured | Check file paths | Verify log file existence |
| High disk usage | Check log rotation | Configure logrotate |
Test Configuration:
/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf
Restart After Config Change:
systemctl restart logstash
journalctl -u logstash -f # Watch for startup errors
PM2 Process Monitoring
PM2 manages the Node.js application processes in production.
Process Overview
Production Processes (ecosystem.config.cjs):
| Process Name | Script | Purpose | Instances |
|---|---|---|---|
flyer-crawler-api |
server.ts |
Express API server | Cluster (max CPUs) |
flyer-crawler-worker |
worker.ts |
BullMQ job processor | 1 |
flyer-crawler-analytics-worker |
worker.ts |
Analytics jobs | 1 |
Test Processes (ecosystem-test.config.cjs):
| Process Name | Script | Port | Instances |
|---|---|---|---|
flyer-crawler-api-test |
server.ts |
3002 | 1 (fork mode) |
flyer-crawler-worker-test |
worker.ts |
N/A | 1 |
flyer-crawler-analytics-worker-test |
worker.ts |
N/A | 1 |
Basic Commands
Note
: These commands are for the user to execute on the server. Claude Code provides commands but cannot run them directly.
# Switch to gitea-runner user (PM2 runs under this user)
su - gitea-runner
# List all processes
pm2 list
# Process details
pm2 show flyer-crawler-api
# Monitor in real-time
pm2 monit
# View logs
pm2 logs flyer-crawler-api
pm2 logs flyer-crawler-worker --lines 100
# View all logs
pm2 logs
# Restart processes
pm2 restart flyer-crawler-api
pm2 restart all
# Reload without downtime (cluster mode only)
pm2 reload flyer-crawler-api
# Stop processes
pm2 stop flyer-crawler-api
Health Indicators
Healthy Process:
┌─────────────────────┬────┬─────────┬─────────┬───────┬────────┬─────────┬──────────┐
│ Name │ id │ mode │ status │ cpu │ mem │ uptime │ restarts │
├─────────────────────┼────┼─────────┼─────────┼───────┼────────┼─────────┼──────────┤
│ flyer-crawler-api │ 0 │ cluster │ online │ 0.5% │ 150MB │ 5d │ 0 │
│ flyer-crawler-api │ 1 │ cluster │ online │ 0.3% │ 145MB │ 5d │ 0 │
│ flyer-crawler-worker│ 2 │ fork │ online │ 0.1% │ 200MB │ 5d │ 0 │
└─────────────────────┴────┴─────────┴─────────┴───────┴────────┴─────────┴──────────┘
Warning Signs:
status: errored- Process crashed- High
restartscount - Instability - High
mem(>500MB for API, >1GB for workers) - Memory leak - Low
uptimewith high restarts - Repeated crashes
Log File Locations
| Process | stdout | stderr |
|---|---|---|
flyer-crawler-api |
/home/gitea-runner/.pm2/logs/flyer-crawler-api-out.log |
...-error.log |
flyer-crawler-worker |
/home/gitea-runner/.pm2/logs/flyer-crawler-worker-out.log |
...-error.log |
Memory Management
PM2 is configured to restart processes when they exceed memory limits:
| Process | Memory Limit | Action |
|---|---|---|
| API | 500MB | Auto-restart |
| Worker | 1GB | Auto-restart |
| Analytics Worker | 1GB | Auto-restart |
Check Memory Usage:
pm2 show flyer-crawler-api | grep memory
pm2 show flyer-crawler-worker | grep memory
Restart Strategies
PM2 uses exponential backoff for restarts:
{
max_restarts: 40,
exp_backoff_restart_delay: 100, // Start at 100ms, exponentially increase
min_uptime: '10s', // Must run 10s to be considered "started"
}
Force Restart After Repeated Failures:
pm2 delete flyer-crawler-api
pm2 start ecosystem.config.cjs --only flyer-crawler-api
Database Monitoring
Connection Pool Status
The application uses a PostgreSQL connection pool with these defaults:
| Setting | Value | Purpose |
|---|---|---|
max |
20 | Maximum concurrent connections |
idleTimeoutMillis |
30000 | Close idle connections after 30s |
connectionTimeoutMillis |
2000 | Fail if connection takes >2s |
Check Pool Status via API:
curl -s https://flyer-crawler.projectium.com/api/health/db-pool | jq .
# Response
{
"success": true,
"data": {
"message": "Pool Status: 10 total, 8 idle, 0 waiting.",
"totalCount": 10,
"idleCount": 8,
"waitingCount": 0
}
}
Pool Health Thresholds:
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Waiting Connections | 0-2 | 3-4 | 5+ |
| Total Connections | 1-15 | 16-19 | 20 (maxed) |
Slow Query Logging
PostgreSQL is configured to log slow queries:
# /etc/postgresql/14/main/conf.d/observability.conf
log_min_duration_statement = 1000 # Log queries over 1 second
View Slow Queries:
ssh root@projectium.com
grep "duration:" /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | tail -20
Database Size Monitoring
# Connect to production database
psql -h localhost -U flyer_crawler_prod -d flyer-crawler-prod
# Database size
SELECT pg_size_pretty(pg_database_size('flyer-crawler-prod'));
# Table sizes
SELECT
relname AS table,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
pg_size_pretty(pg_relation_size(relid)) AS data_size,
pg_size_pretty(pg_indexes_size(relid)) AS index_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;
# Check for bloat
SELECT schemaname, relname, n_dead_tup, n_live_tup,
round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;
Disk Space Monitoring
# Check PostgreSQL data directory
du -sh /var/lib/postgresql/14/main/
# Check available disk space
df -h /var/lib/postgresql/
# Estimate growth rate
psql -c "SELECT date_trunc('day', created_at) as day, count(*)
FROM flyer_items
WHERE created_at > now() - interval '7 days'
GROUP BY 1 ORDER BY 1;"
Database Health via MCP
# Query database directly
mcp__devdb__query --sql "SELECT count(*) FROM flyers WHERE created_at > now() - interval '1 day'"
# Check connection count
mcp__devdb__query --sql "SELECT count(*) FROM pg_stat_activity WHERE datname = 'flyer_crawler_dev'"
Redis Monitoring
Basic Health Check
# Via API endpoint
curl -s https://flyer-crawler.projectium.com/api/health/redis | jq .
# Direct Redis check (on server)
redis-cli ping # Should return PONG
Memory Usage
redis-cli info memory | grep -E "used_memory_human|maxmemory_human|mem_fragmentation_ratio"
# Expected output
used_memory_human:50.00M
maxmemory_human:256.00M
mem_fragmentation_ratio:1.05
Memory Thresholds:
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Used Memory | <70% of max | 70-85% | >85% |
| Fragmentation Ratio | 1.0-1.5 | 1.5-2.0 | >2.0 |
Cache Statistics
redis-cli info stats | grep -E "keyspace_hits|keyspace_misses|evicted_keys"
# Calculate hit rate
# Hit Rate = keyspace_hits / (keyspace_hits + keyspace_misses) * 100
Cache Hit Rate Targets:
- Excellent: >95%
- Good: 85-95%
- Needs attention: <85%
Queue Monitoring
BullMQ queues are stored in Redis:
# List all queues
redis-cli keys "bull:*:id"
# Check queue depths
redis-cli llen "bull:flyer-processing:wait"
redis-cli llen "bull:email-sending:wait"
redis-cli llen "bull:analytics-reporting:wait"
# Check failed jobs
redis-cli llen "bull:flyer-processing:failed"
Queue Depth Thresholds:
| Queue | Normal | Warning | Critical |
|---|---|---|---|
| flyer-processing | 0-10 | 11-50 | >50 |
| email-sending | 0-100 | 101-500 | >500 |
| analytics-reporting | 0-5 | 6-20 | >20 |
Bull Board UI
Access the job queue dashboard:
- Production:
https://flyer-crawler.projectium.com/api/admin/jobs(requires admin auth) - Test:
https://flyer-crawler-test.projectium.com/api/admin/jobs - Dev:
http://localhost:3001/api/admin/jobs
Features:
- View all queues and job counts
- Inspect job data and errors
- Retry failed jobs
- Clean completed jobs
Redis Database Allocation
| Database | Purpose |
|---|---|
| 0 | BullMQ production queues |
| 1 | BullMQ test queues |
| 15 | Bugsink sync state |
Production Alerts and On-Call
Critical Monitoring Targets
| Service | Check | Interval | Alert Threshold |
|---|---|---|---|
| API Server | /api/health/ready |
1 min | 2 consecutive failures |
| Database | Pool waiting count | 1 min | >5 waiting |
| Redis | Memory usage | 5 min | >85% of maxmemory |
| Disk Space | /var/log |
15 min | <10GB free |
| Worker | Queue depth | 5 min | >50 jobs waiting |
| Error Rate | Bugsink issue count | 15 min | >10 new issues/hour |
Alert Channels
Configure alerts in your monitoring tool (UptimeRobot, Datadog, etc.):
- Slack channel:
#flyer-crawler-alerts - Email: On-call rotation email
- PagerDuty: Critical issues only
On-Call Response Procedures
P1 - Critical (Site Down):
- Acknowledge alert within 5 minutes
- Check
/api/health/ready- identify failing service - Check PM2 status:
pm2 list - Check recent deploys:
git log -5 --oneline - If database: check pool, restart if needed
- If Redis: check memory, flush if critical
- If application: restart PM2 processes
- Document in incident channel
P2 - High (Degraded Service):
- Acknowledge within 15 minutes
- Review Bugsink for error patterns
- Check system resources (CPU, memory, disk)
- Identify root cause
- Plan remediation
- Create Gitea issue if not auto-created
P3 - Medium (Non-Critical):
- Acknowledge within 1 hour
- Review during business hours
- Create Gitea issue for tracking
Quick Diagnostic Commands
Note
: User executes these commands on the server. Claude Code provides commands but cannot run them directly.
# Service status checks
systemctl status pm2-gitea-runner --no-pager
systemctl status logstash --no-pager
systemctl status redis --no-pager
systemctl status postgresql --no-pager
# PM2 processes (run as gitea-runner)
su - gitea-runner -c "pm2 list"
# Disk space
df -h / /var
# Memory
free -h
# Recent errors
journalctl -p err -n 20 --no-pager
Runbook Quick Reference
| Symptom | First Action | If That Fails |
|---|---|---|
| 503 errors | Restart PM2 | Check database, Redis |
| Slow responses | Check DB pool | Review slow query log |
| High error rate | Check Bugsink | Review recent deploys |
| Queue backlog | Restart worker | Scale workers |
| Out of memory | Restart process | Increase PM2 limit |
| Disk full | Clean old logs | Expand volume |
| Redis OOM | Flush cache keys | Increase maxmemory |
Post-Incident Review
After any P1/P2 incident:
- Write incident report within 24 hours
- Identify root cause
- Document timeline of events
- List action items to prevent recurrence
- Schedule review meeting if needed
- Update runbooks if new procedures discovered