All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 19m11s
897 lines
28 KiB
Markdown
897 lines
28 KiB
Markdown
# Monitoring Guide
|
|
|
|
This guide covers all aspects of monitoring the Flyer Crawler application across development, test, and production environments.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Health Checks](#health-checks)
|
|
2. [Bugsink Error Tracking](#bugsink-error-tracking)
|
|
3. [Logstash Log Aggregation](#logstash-log-aggregation)
|
|
4. [PM2 Process Monitoring](#pm2-process-monitoring)
|
|
5. [Database Monitoring](#database-monitoring)
|
|
6. [Redis Monitoring](#redis-monitoring)
|
|
7. [Production Alerts and On-Call](#production-alerts-and-on-call)
|
|
|
|
---
|
|
|
|
## Health Checks
|
|
|
|
The application exposes health check endpoints at `/api/health/*` implementing ADR-020.
|
|
|
|
### Endpoint Reference
|
|
|
|
| Endpoint | Purpose | Use Case |
|
|
| ----------------------- | ---------------------- | --------------------------------------- |
|
|
| `/api/health/ping` | Simple connectivity | Quick "is it running?" check |
|
|
| `/api/health/live` | Liveness probe | Container orchestration restart trigger |
|
|
| `/api/health/ready` | Readiness probe | Load balancer traffic routing |
|
|
| `/api/health/startup` | Startup probe | Initial container readiness |
|
|
| `/api/health/db-schema` | Schema verification | Deployment validation |
|
|
| `/api/health/db-pool` | Connection pool status | Performance diagnostics |
|
|
| `/api/health/redis` | Redis connectivity | Cache/queue health |
|
|
| `/api/health/storage` | File storage access | Upload capability |
|
|
| `/api/health/time` | Server time sync | Time-sensitive operations |
|
|
|
|
### Liveness Probe (`/api/health/live`)
|
|
|
|
Returns 200 OK if the Node.js process is running. No external dependencies.
|
|
|
|
```bash
|
|
# Check liveness
|
|
curl -s https://flyer-crawler.projectium.com/api/health/live | jq .
|
|
|
|
# Expected response
|
|
{
|
|
"success": true,
|
|
"data": {
|
|
"status": "ok",
|
|
"timestamp": "2026-01-22T10:00:00.000Z"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Usage**: If this endpoint fails, restart the application immediately.
|
|
|
|
### Readiness Probe (`/api/health/ready`)
|
|
|
|
Comprehensive check of all critical dependencies: database, Redis, and storage.
|
|
|
|
```bash
|
|
# Check readiness
|
|
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq .
|
|
|
|
# Expected healthy response (200)
|
|
{
|
|
"success": true,
|
|
"data": {
|
|
"status": "healthy",
|
|
"timestamp": "2026-01-22T10:00:00.000Z",
|
|
"uptime": 3600.5,
|
|
"services": {
|
|
"database": {
|
|
"status": "healthy",
|
|
"latency": 5,
|
|
"details": {
|
|
"totalConnections": 10,
|
|
"idleConnections": 8,
|
|
"waitingConnections": 0
|
|
}
|
|
},
|
|
"redis": {
|
|
"status": "healthy",
|
|
"latency": 2
|
|
},
|
|
"storage": {
|
|
"status": "healthy",
|
|
"latency": 1,
|
|
"details": {
|
|
"path": "/var/www/flyer-crawler.projectium.com/flyer-images"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Status Values**:
|
|
|
|
| Status | Meaning | Action |
|
|
| ----------- | ------------------------------------------------ | ------------------------- |
|
|
| `healthy` | All critical services operational | None required |
|
|
| `degraded` | Non-critical issues (e.g., high connection wait) | Monitor closely |
|
|
| `unhealthy` | Critical service unavailable (returns 503) | Remove from load balancer |
|
|
|
|
### Database Health Thresholds
|
|
|
|
| Metric | Healthy | Degraded | Unhealthy |
|
|
| ------------------- | ------------------- | -------- | ---------------- |
|
|
| Query response | `SELECT 1` succeeds | N/A | Connection fails |
|
|
| Waiting connections | 0-3 | 4+ | N/A |
|
|
|
|
### Verifying Services from CLI
|
|
|
|
**Production**:
|
|
|
|
```bash
|
|
# Quick health check
|
|
curl -s https://flyer-crawler.projectium.com/api/health/ready | jq '.data.status'
|
|
|
|
# Database pool status
|
|
curl -s https://flyer-crawler.projectium.com/api/health/db-pool | jq .
|
|
|
|
# Redis health
|
|
curl -s https://flyer-crawler.projectium.com/api/health/redis | jq .
|
|
```
|
|
|
|
**Test Environment**:
|
|
|
|
```bash
|
|
# Test environment runs on port 3002
|
|
curl -s https://flyer-crawler-test.projectium.com/api/health/ready | jq .
|
|
```
|
|
|
|
**Dev Container**:
|
|
|
|
```bash
|
|
# From inside the container
|
|
curl -s http://localhost:3001/api/health/ready | jq .
|
|
|
|
# From Windows host (via port mapping)
|
|
curl -s http://localhost:3001/api/health/ready | jq .
|
|
```
|
|
|
|
### Admin System Check UI
|
|
|
|
The admin dashboard at `/admin` includes a **System Check** component that runs all health checks with a visual interface:
|
|
|
|
1. Navigate to `https://flyer-crawler.projectium.com/admin`
|
|
2. Login with admin credentials
|
|
3. View the "System Check" section
|
|
4. Click "Re-run Checks" to verify all services
|
|
|
|
Checks include:
|
|
|
|
- Backend Server Connection
|
|
- PM2 Process Status
|
|
- Database Connection Pool
|
|
- Redis Connection
|
|
- Database Schema
|
|
- Default Admin User
|
|
- Assets Storage Directory
|
|
- Gemini API Key
|
|
|
|
---
|
|
|
|
## Bugsink Error Tracking
|
|
|
|
Bugsink is our self-hosted, Sentry-compatible error tracking system (ADR-015).
|
|
|
|
### Access Points
|
|
|
|
| Environment | URL | Purpose |
|
|
| ----------------- | -------------------------------- | -------------------------- |
|
|
| **Production** | `https://bugsink.projectium.com` | Production and test errors |
|
|
| **Dev Container** | `https://localhost:8443` | Local development errors |
|
|
|
|
### Credentials
|
|
|
|
**Production Bugsink**:
|
|
|
|
- Credentials stored in password manager
|
|
- Admin account created during initial deployment
|
|
|
|
**Dev Container Bugsink**:
|
|
|
|
- Email: `admin@localhost`
|
|
- Password: `admin`
|
|
|
|
### Projects
|
|
|
|
| Project ID | Name | Environment | Error Source |
|
|
| ---------- | --------------------------------- | ----------- | ------------------------------- |
|
|
| 1 | flyer-crawler-backend | Production | Backend Node.js errors |
|
|
| 2 | flyer-crawler-frontend | Production | Frontend JavaScript errors |
|
|
| 3 | flyer-crawler-backend-test | Test | Test environment backend |
|
|
| 4 | flyer-crawler-frontend-test | Test | Test environment frontend |
|
|
| 5 | flyer-crawler-infrastructure | Production | PostgreSQL, Redis, NGINX errors |
|
|
| 6 | flyer-crawler-test-infrastructure | Test | Test infra errors |
|
|
|
|
**Dev Container Projects** (localhost:8000):
|
|
|
|
- Project 1: Backend (Dev)
|
|
- Project 2: Frontend (Dev)
|
|
|
|
### Accessing Errors via Web UI
|
|
|
|
1. Navigate to the Bugsink URL
|
|
2. Login with credentials
|
|
3. Select project from the sidebar
|
|
4. Click on an issue to view details
|
|
|
|
**Issue Details Include**:
|
|
|
|
- Exception type and message
|
|
- Full stack trace
|
|
- Request context (URL, method, headers)
|
|
- User context (if authenticated)
|
|
- Occurrence statistics (first seen, last seen, count)
|
|
- Release/version information
|
|
|
|
### Accessing Errors via MCP
|
|
|
|
Claude Code and other AI tools can access Bugsink via MCP servers.
|
|
|
|
**Available MCP Tools**:
|
|
|
|
```bash
|
|
# List all projects
|
|
mcp__bugsink__list_projects
|
|
|
|
# List unresolved issues for a project
|
|
mcp__bugsink__list_issues --project_id 1 --status unresolved
|
|
|
|
# Get issue details
|
|
mcp__bugsink__get_issue --issue_id <uuid>
|
|
|
|
# Get stacktrace (pre-rendered Markdown)
|
|
mcp__bugsink__get_stacktrace --event_id <uuid>
|
|
|
|
# List events for an issue
|
|
mcp__bugsink__list_events --issue_id <uuid>
|
|
```
|
|
|
|
**MCP Server Configuration**:
|
|
|
|
Production (in `~/.claude/settings.json`):
|
|
|
|
```json
|
|
{
|
|
"bugsink": {
|
|
"command": "node",
|
|
"args": ["d:\\gitea\\bugsink-mcp\\dist\\index.js"],
|
|
"env": {
|
|
"BUGSINK_URL": "https://bugsink.projectium.com",
|
|
"BUGSINK_TOKEN": "<token>"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Dev Container (in `.mcp.json`):
|
|
|
|
```json
|
|
{
|
|
"localerrors": {
|
|
"command": "node",
|
|
"args": ["d:\\gitea\\bugsink-mcp\\dist\\index.js"],
|
|
"env": {
|
|
"BUGSINK_URL": "http://127.0.0.1:8000",
|
|
"BUGSINK_TOKEN": "<token>"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Creating API Tokens
|
|
|
|
Bugsink 2.0.11 does not have a UI for API tokens. Create via Django management command.
|
|
|
|
**Production**:
|
|
|
|
```bash
|
|
ssh root@projectium.com "cd /opt/bugsink && bugsink-manage create_auth_token"
|
|
```
|
|
|
|
**Dev Container**:
|
|
|
|
```bash
|
|
MSYS_NO_PATHCONV=1 podman exec -e DATABASE_URL=postgresql://bugsink:bugsink_dev_password@postgres:5432/bugsink -e SECRET_KEY=dev-bugsink-secret-key-minimum-50-characters-for-security flyer-crawler-dev sh -c 'cd /opt/bugsink/conf && DJANGO_SETTINGS_MODULE=bugsink_conf PYTHONPATH=/opt/bugsink/conf:/opt/bugsink/lib/python3.10/site-packages /opt/bugsink/bin/python -m django create_auth_token'
|
|
```
|
|
|
|
The command outputs a 40-character hex token.
|
|
|
|
### Interpreting Errors
|
|
|
|
**Error Anatomy**:
|
|
|
|
```
|
|
TypeError: Cannot read properties of undefined (reading 'map')
|
|
├── Exception Type: TypeError
|
|
├── Message: Cannot read properties of undefined (reading 'map')
|
|
├── Where: FlyerItemsList.tsx:45:23
|
|
├── When: 2026-01-22T10:30:00.000Z
|
|
├── Count: 12 occurrences
|
|
└── Context:
|
|
├── URL: GET /api/flyers/123/items
|
|
├── User: user@example.com
|
|
└── Release: v0.12.5
|
|
```
|
|
|
|
**Common Error Patterns**:
|
|
|
|
| Pattern | Likely Cause | Investigation |
|
|
| ----------------------------------- | ------------------------------------------------- | -------------------------------------------------- |
|
|
| `TypeError: ... undefined` | Missing null check, API returned unexpected shape | Check API response, add defensive coding |
|
|
| `DatabaseError: Connection timeout` | Pool exhaustion, slow queries | Check `/api/health/db-pool`, review slow query log |
|
|
| `RedisConnectionError` | Redis unavailable | Check Redis service, network connectivity |
|
|
| `ValidationError: ...` | Invalid input, schema mismatch | Review request payload, update validation |
|
|
| `NotFoundError: ...` | Missing resource | Verify resource exists, check ID format |
|
|
|
|
### Error Triage Workflow
|
|
|
|
1. **Review new issues daily** in Bugsink
|
|
2. **Categorize by severity**:
|
|
- **Critical**: Data corruption, security, payment failures
|
|
- **High**: Core feature broken for many users
|
|
- **Medium**: Feature degraded, workaround available
|
|
- **Low**: Minor UX issues, cosmetic bugs
|
|
3. **Check occurrence count** - frequent errors need urgent attention
|
|
4. **Review stack trace** - identify root cause
|
|
5. **Check recent deployments** - did a release introduce this?
|
|
6. **Create Gitea issue** if not auto-synced
|
|
|
|
### Bugsink-to-Gitea Sync
|
|
|
|
The test environment automatically syncs Bugsink issues to Gitea (see `docs/BUGSINK-SYNC.md`).
|
|
|
|
**Sync Workflow**:
|
|
|
|
1. Runs every 15 minutes on test server
|
|
2. Fetches unresolved issues from all Bugsink projects
|
|
3. Creates Gitea issues with appropriate labels
|
|
4. Marks synced issues as resolved in Bugsink
|
|
|
|
**Manual Sync**:
|
|
|
|
```bash
|
|
# Trigger sync via API (test environment only)
|
|
curl -X POST https://flyer-crawler-test.projectium.com/api/admin/bugsink/sync \
|
|
-H "Authorization: Bearer <admin_jwt>"
|
|
```
|
|
|
|
---
|
|
|
|
## Logstash Log Aggregation
|
|
|
|
Logstash aggregates logs from multiple sources and forwards errors to Bugsink (ADR-050).
|
|
|
|
### Architecture
|
|
|
|
```
|
|
Log Sources Logstash Outputs
|
|
┌──────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ PostgreSQL │──────────────│ │───────────│ Bugsink │
|
|
│ PM2 Workers │──────────────│ Filter │───────────│ (errors) │
|
|
│ Redis │──────────────│ & Route │───────────│ │
|
|
│ NGINX │──────────────│ │───────────│ File Logs │
|
|
└──────────────┘ └─────────────┘ │ (all logs) │
|
|
└─────────────┘
|
|
```
|
|
|
|
### Configuration Files
|
|
|
|
| Path | Purpose |
|
|
| --------------------------------------------------- | --------------------------- |
|
|
| `/etc/logstash/conf.d/bugsink.conf` | Main pipeline configuration |
|
|
| `/etc/postgresql/14/main/conf.d/observability.conf` | PostgreSQL logging settings |
|
|
| `/var/log/logstash/` | Logstash file outputs |
|
|
| `/var/lib/logstash/sincedb_*` | File position tracking |
|
|
|
|
### Log Sources
|
|
|
|
| Source | Path | Contents |
|
|
| ----------- | -------------------------------------------------- | ----------------------------------- |
|
|
| PostgreSQL | `/var/log/postgresql/*.log` | Function logs, slow queries, errors |
|
|
| PM2 Workers | `/home/gitea-runner/.pm2/logs/flyer-crawler-*.log` | Worker stdout/stderr |
|
|
| Redis | `/var/log/redis/redis-server.log` | Connection errors, memory warnings |
|
|
| NGINX | `/var/log/nginx/access.log`, `error.log` | HTTP requests, upstream errors |
|
|
|
|
### Pipeline Status
|
|
|
|
**Check Logstash Service**:
|
|
|
|
```bash
|
|
ssh root@projectium.com
|
|
|
|
# Service status
|
|
systemctl status logstash
|
|
|
|
# Recent logs
|
|
journalctl -u logstash -n 50 --no-pager
|
|
|
|
# Pipeline statistics
|
|
curl -s http://localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.events'
|
|
|
|
# Events processed today
|
|
curl -s http://localhost:9600/_node/stats/pipelines?pretty | jq '{
|
|
in: .pipelines.main.events.in,
|
|
out: .pipelines.main.events.out,
|
|
filtered: .pipelines.main.events.filtered
|
|
}'
|
|
```
|
|
|
|
**Check Filter Performance**:
|
|
|
|
```bash
|
|
# Grok pattern success/failure rates
|
|
curl -s http://localhost:9600/_node/stats/pipelines?pretty | \
|
|
jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | {name, events_in: .events.in, events_out: .events.out, failures}'
|
|
```
|
|
|
|
### Viewing Aggregated Logs
|
|
|
|
```bash
|
|
# PM2 worker logs (all workers combined)
|
|
tail -f /var/log/logstash/pm2-workers-$(date +%Y-%m-%d).log
|
|
|
|
# Redis operational logs
|
|
tail -f /var/log/logstash/redis-operational-$(date +%Y-%m-%d).log
|
|
|
|
# NGINX access logs (parsed)
|
|
tail -f /var/log/logstash/nginx-access-$(date +%Y-%m-%d).log
|
|
|
|
# PostgreSQL function logs
|
|
tail -f /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log
|
|
```
|
|
|
|
### Troubleshooting Logstash
|
|
|
|
| Issue | Diagnostic | Solution |
|
|
| --------------------- | --------------------------- | ------------------------------- |
|
|
| No events processed | `systemctl status logstash` | Start/restart service |
|
|
| Config syntax error | Test config command | Fix config file |
|
|
| Grok failures | Check stats endpoint | Update grok patterns |
|
|
| Wrong Bugsink project | Check environment tags | Verify tag routing |
|
|
| Permission denied | `groups logstash` | Add to `postgres`, `adm` groups |
|
|
| PM2 logs not captured | Check file paths | Verify log file existence |
|
|
| High disk usage | Check log rotation | Configure logrotate |
|
|
|
|
**Test Configuration**:
|
|
|
|
```bash
|
|
/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf
|
|
```
|
|
|
|
**Restart After Config Change**:
|
|
|
|
```bash
|
|
systemctl restart logstash
|
|
journalctl -u logstash -f # Watch for startup errors
|
|
```
|
|
|
|
---
|
|
|
|
## PM2 Process Monitoring
|
|
|
|
PM2 manages the Node.js application processes in production.
|
|
|
|
### Process Overview
|
|
|
|
**Production Processes** (`ecosystem.config.cjs`):
|
|
|
|
| Process Name | Script | Purpose | Instances |
|
|
| -------------------------------- | ----------- | -------------------- | ------------------ |
|
|
| `flyer-crawler-api` | `server.ts` | Express API server | Cluster (max CPUs) |
|
|
| `flyer-crawler-worker` | `worker.ts` | BullMQ job processor | 1 |
|
|
| `flyer-crawler-analytics-worker` | `worker.ts` | Analytics jobs | 1 |
|
|
|
|
**Test Processes** (`ecosystem-test.config.cjs`):
|
|
|
|
| Process Name | Script | Port | Instances |
|
|
| ------------------------------------- | ----------- | ---- | ------------- |
|
|
| `flyer-crawler-api-test` | `server.ts` | 3002 | 1 (fork mode) |
|
|
| `flyer-crawler-worker-test` | `worker.ts` | N/A | 1 |
|
|
| `flyer-crawler-analytics-worker-test` | `worker.ts` | N/A | 1 |
|
|
|
|
### Basic Commands
|
|
|
|
```bash
|
|
ssh root@projectium.com
|
|
su - gitea-runner # PM2 runs under this user
|
|
|
|
# List all processes
|
|
pm2 list
|
|
|
|
# Process details
|
|
pm2 show flyer-crawler-api
|
|
|
|
# Monitor in real-time
|
|
pm2 monit
|
|
|
|
# View logs
|
|
pm2 logs flyer-crawler-api
|
|
pm2 logs flyer-crawler-worker --lines 100
|
|
|
|
# View all logs
|
|
pm2 logs
|
|
|
|
# Restart processes
|
|
pm2 restart flyer-crawler-api
|
|
pm2 restart all
|
|
|
|
# Reload without downtime (cluster mode only)
|
|
pm2 reload flyer-crawler-api
|
|
|
|
# Stop processes
|
|
pm2 stop flyer-crawler-api
|
|
```
|
|
|
|
### Health Indicators
|
|
|
|
**Healthy Process**:
|
|
|
|
```
|
|
┌─────────────────────┬────┬─────────┬─────────┬───────┬────────┬─────────┬──────────┐
|
|
│ Name │ id │ mode │ status │ cpu │ mem │ uptime │ restarts │
|
|
├─────────────────────┼────┼─────────┼─────────┼───────┼────────┼─────────┼──────────┤
|
|
│ flyer-crawler-api │ 0 │ cluster │ online │ 0.5% │ 150MB │ 5d │ 0 │
|
|
│ flyer-crawler-api │ 1 │ cluster │ online │ 0.3% │ 145MB │ 5d │ 0 │
|
|
│ flyer-crawler-worker│ 2 │ fork │ online │ 0.1% │ 200MB │ 5d │ 0 │
|
|
└─────────────────────┴────┴─────────┴─────────┴───────┴────────┴─────────┴──────────┘
|
|
```
|
|
|
|
**Warning Signs**:
|
|
|
|
- `status: errored` - Process crashed
|
|
- High `restarts` count - Instability
|
|
- High `mem` (>500MB for API, >1GB for workers) - Memory leak
|
|
- Low `uptime` with high restarts - Repeated crashes
|
|
|
|
### Log File Locations
|
|
|
|
| Process | stdout | stderr |
|
|
| ---------------------- | ----------------------------------------------------------- | --------------- |
|
|
| `flyer-crawler-api` | `/home/gitea-runner/.pm2/logs/flyer-crawler-api-out.log` | `...-error.log` |
|
|
| `flyer-crawler-worker` | `/home/gitea-runner/.pm2/logs/flyer-crawler-worker-out.log` | `...-error.log` |
|
|
|
|
### Memory Management
|
|
|
|
PM2 is configured to restart processes when they exceed memory limits:
|
|
|
|
| Process | Memory Limit | Action |
|
|
| ---------------- | ------------ | ------------ |
|
|
| API | 500MB | Auto-restart |
|
|
| Worker | 1GB | Auto-restart |
|
|
| Analytics Worker | 1GB | Auto-restart |
|
|
|
|
**Check Memory Usage**:
|
|
|
|
```bash
|
|
pm2 show flyer-crawler-api | grep memory
|
|
pm2 show flyer-crawler-worker | grep memory
|
|
```
|
|
|
|
### Restart Strategies
|
|
|
|
PM2 uses exponential backoff for restarts:
|
|
|
|
```javascript
|
|
{
|
|
max_restarts: 40,
|
|
exp_backoff_restart_delay: 100, // Start at 100ms, exponentially increase
|
|
min_uptime: '10s', // Must run 10s to be considered "started"
|
|
}
|
|
```
|
|
|
|
**Force Restart After Repeated Failures**:
|
|
|
|
```bash
|
|
pm2 delete flyer-crawler-api
|
|
pm2 start ecosystem.config.cjs --only flyer-crawler-api
|
|
```
|
|
|
|
---
|
|
|
|
## Database Monitoring
|
|
|
|
### Connection Pool Status
|
|
|
|
The application uses a PostgreSQL connection pool with these defaults:
|
|
|
|
| Setting | Value | Purpose |
|
|
| ------------------------- | ----- | -------------------------------- |
|
|
| `max` | 20 | Maximum concurrent connections |
|
|
| `idleTimeoutMillis` | 30000 | Close idle connections after 30s |
|
|
| `connectionTimeoutMillis` | 2000 | Fail if connection takes >2s |
|
|
|
|
**Check Pool Status via API**:
|
|
|
|
```bash
|
|
curl -s https://flyer-crawler.projectium.com/api/health/db-pool | jq .
|
|
|
|
# Response
|
|
{
|
|
"success": true,
|
|
"data": {
|
|
"message": "Pool Status: 10 total, 8 idle, 0 waiting.",
|
|
"totalCount": 10,
|
|
"idleCount": 8,
|
|
"waitingCount": 0
|
|
}
|
|
}
|
|
```
|
|
|
|
**Pool Health Thresholds**:
|
|
|
|
| Metric | Healthy | Warning | Critical |
|
|
| ------------------- | ------- | ------- | ---------- |
|
|
| Waiting Connections | 0-2 | 3-4 | 5+ |
|
|
| Total Connections | 1-15 | 16-19 | 20 (maxed) |
|
|
|
|
### Slow Query Logging
|
|
|
|
PostgreSQL is configured to log slow queries:
|
|
|
|
```ini
|
|
# /etc/postgresql/14/main/conf.d/observability.conf
|
|
log_min_duration_statement = 1000 # Log queries over 1 second
|
|
```
|
|
|
|
**View Slow Queries**:
|
|
|
|
```bash
|
|
ssh root@projectium.com
|
|
grep "duration:" /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | tail -20
|
|
```
|
|
|
|
### Database Size Monitoring
|
|
|
|
```bash
|
|
# Connect to production database
|
|
psql -h localhost -U flyer_crawler_prod -d flyer-crawler-prod
|
|
|
|
# Database size
|
|
SELECT pg_size_pretty(pg_database_size('flyer-crawler-prod'));
|
|
|
|
# Table sizes
|
|
SELECT
|
|
relname AS table,
|
|
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
|
|
pg_size_pretty(pg_relation_size(relid)) AS data_size,
|
|
pg_size_pretty(pg_indexes_size(relid)) AS index_size
|
|
FROM pg_catalog.pg_statio_user_tables
|
|
ORDER BY pg_total_relation_size(relid) DESC
|
|
LIMIT 10;
|
|
|
|
# Check for bloat
|
|
SELECT schemaname, relname, n_dead_tup, n_live_tup,
|
|
round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct
|
|
FROM pg_stat_user_tables
|
|
WHERE n_dead_tup > 1000
|
|
ORDER BY n_dead_tup DESC;
|
|
```
|
|
|
|
### Disk Space Monitoring
|
|
|
|
```bash
|
|
# Check PostgreSQL data directory
|
|
du -sh /var/lib/postgresql/14/main/
|
|
|
|
# Check available disk space
|
|
df -h /var/lib/postgresql/
|
|
|
|
# Estimate growth rate
|
|
psql -c "SELECT date_trunc('day', created_at) as day, count(*)
|
|
FROM flyer_items
|
|
WHERE created_at > now() - interval '7 days'
|
|
GROUP BY 1 ORDER BY 1;"
|
|
```
|
|
|
|
### Database Health via MCP
|
|
|
|
```bash
|
|
# Query database directly
|
|
mcp__devdb__query --sql "SELECT count(*) FROM flyers WHERE created_at > now() - interval '1 day'"
|
|
|
|
# Check connection count
|
|
mcp__devdb__query --sql "SELECT count(*) FROM pg_stat_activity WHERE datname = 'flyer_crawler_dev'"
|
|
```
|
|
|
|
---
|
|
|
|
## Redis Monitoring
|
|
|
|
### Basic Health Check
|
|
|
|
```bash
|
|
# Via API endpoint
|
|
curl -s https://flyer-crawler.projectium.com/api/health/redis | jq .
|
|
|
|
# Direct Redis check (on server)
|
|
redis-cli ping # Should return PONG
|
|
```
|
|
|
|
### Memory Usage
|
|
|
|
```bash
|
|
redis-cli info memory | grep -E "used_memory_human|maxmemory_human|mem_fragmentation_ratio"
|
|
|
|
# Expected output
|
|
used_memory_human:50.00M
|
|
maxmemory_human:256.00M
|
|
mem_fragmentation_ratio:1.05
|
|
```
|
|
|
|
**Memory Thresholds**:
|
|
|
|
| Metric | Healthy | Warning | Critical |
|
|
| ------------------- | ----------- | ------- | -------- |
|
|
| Used Memory | <70% of max | 70-85% | >85% |
|
|
| Fragmentation Ratio | 1.0-1.5 | 1.5-2.0 | >2.0 |
|
|
|
|
### Cache Statistics
|
|
|
|
```bash
|
|
redis-cli info stats | grep -E "keyspace_hits|keyspace_misses|evicted_keys"
|
|
|
|
# Calculate hit rate
|
|
# Hit Rate = keyspace_hits / (keyspace_hits + keyspace_misses) * 100
|
|
```
|
|
|
|
**Cache Hit Rate Targets**:
|
|
|
|
- Excellent: >95%
|
|
- Good: 85-95%
|
|
- Needs attention: <85%
|
|
|
|
### Queue Monitoring
|
|
|
|
BullMQ queues are stored in Redis:
|
|
|
|
```bash
|
|
# List all queues
|
|
redis-cli keys "bull:*:id"
|
|
|
|
# Check queue depths
|
|
redis-cli llen "bull:flyer-processing:wait"
|
|
redis-cli llen "bull:email-sending:wait"
|
|
redis-cli llen "bull:analytics-reporting:wait"
|
|
|
|
# Check failed jobs
|
|
redis-cli llen "bull:flyer-processing:failed"
|
|
```
|
|
|
|
**Queue Depth Thresholds**:
|
|
|
|
| Queue | Normal | Warning | Critical |
|
|
| ------------------- | ------ | ------- | -------- |
|
|
| flyer-processing | 0-10 | 11-50 | >50 |
|
|
| email-sending | 0-100 | 101-500 | >500 |
|
|
| analytics-reporting | 0-5 | 6-20 | >20 |
|
|
|
|
### Bull Board UI
|
|
|
|
Access the job queue dashboard:
|
|
|
|
- **Production**: `https://flyer-crawler.projectium.com/api/admin/jobs` (requires admin auth)
|
|
- **Test**: `https://flyer-crawler-test.projectium.com/api/admin/jobs`
|
|
- **Dev**: `http://localhost:3001/api/admin/jobs`
|
|
|
|
Features:
|
|
|
|
- View all queues and job counts
|
|
- Inspect job data and errors
|
|
- Retry failed jobs
|
|
- Clean completed jobs
|
|
|
|
### Redis Database Allocation
|
|
|
|
| Database | Purpose |
|
|
| -------- | ------------------------ |
|
|
| 0 | BullMQ production queues |
|
|
| 1 | BullMQ test queues |
|
|
| 15 | Bugsink sync state |
|
|
|
|
---
|
|
|
|
## Production Alerts and On-Call
|
|
|
|
### Critical Monitoring Targets
|
|
|
|
| Service | Check | Interval | Alert Threshold |
|
|
| ---------- | ------------------- | -------- | ---------------------- |
|
|
| API Server | `/api/health/ready` | 1 min | 2 consecutive failures |
|
|
| Database | Pool waiting count | 1 min | >5 waiting |
|
|
| Redis | Memory usage | 5 min | >85% of maxmemory |
|
|
| Disk Space | `/var/log` | 15 min | <10GB free |
|
|
| Worker | Queue depth | 5 min | >50 jobs waiting |
|
|
| Error Rate | Bugsink issue count | 15 min | >10 new issues/hour |
|
|
|
|
### Alert Channels
|
|
|
|
Configure alerts in your monitoring tool (UptimeRobot, Datadog, etc.):
|
|
|
|
1. **Slack channel**: `#flyer-crawler-alerts`
|
|
2. **Email**: On-call rotation email
|
|
3. **PagerDuty**: Critical issues only
|
|
|
|
### On-Call Response Procedures
|
|
|
|
**P1 - Critical (Site Down)**:
|
|
|
|
1. Acknowledge alert within 5 minutes
|
|
2. Check `/api/health/ready` - identify failing service
|
|
3. Check PM2 status: `pm2 list`
|
|
4. Check recent deploys: `git log -5 --oneline`
|
|
5. If database: check pool, restart if needed
|
|
6. If Redis: check memory, flush if critical
|
|
7. If application: restart PM2 processes
|
|
8. Document in incident channel
|
|
|
|
**P2 - High (Degraded Service)**:
|
|
|
|
1. Acknowledge within 15 minutes
|
|
2. Review Bugsink for error patterns
|
|
3. Check system resources (CPU, memory, disk)
|
|
4. Identify root cause
|
|
5. Plan remediation
|
|
6. Create Gitea issue if not auto-created
|
|
|
|
**P3 - Medium (Non-Critical)**:
|
|
|
|
1. Acknowledge within 1 hour
|
|
2. Review during business hours
|
|
3. Create Gitea issue for tracking
|
|
|
|
### Quick Diagnostic Commands
|
|
|
|
```bash
|
|
# Full system health check
|
|
ssh root@projectium.com << 'EOF'
|
|
echo "=== Service Status ==="
|
|
systemctl status pm2-gitea-runner --no-pager
|
|
systemctl status logstash --no-pager
|
|
systemctl status redis --no-pager
|
|
systemctl status postgresql --no-pager
|
|
|
|
echo "=== PM2 Processes ==="
|
|
su - gitea-runner -c "pm2 list"
|
|
|
|
echo "=== Disk Space ==="
|
|
df -h / /var
|
|
|
|
echo "=== Memory ==="
|
|
free -h
|
|
|
|
echo "=== Recent Errors ==="
|
|
journalctl -p err -n 20 --no-pager
|
|
EOF
|
|
```
|
|
|
|
### Runbook Quick Reference
|
|
|
|
| Symptom | First Action | If That Fails |
|
|
| --------------- | ---------------- | --------------------- |
|
|
| 503 errors | Restart PM2 | Check database, Redis |
|
|
| Slow responses | Check DB pool | Review slow query log |
|
|
| High error rate | Check Bugsink | Review recent deploys |
|
|
| Queue backlog | Restart worker | Scale workers |
|
|
| Out of memory | Restart process | Increase PM2 limit |
|
|
| Disk full | Clean old logs | Expand volume |
|
|
| Redis OOM | Flush cache keys | Increase maxmemory |
|
|
|
|
### Post-Incident Review
|
|
|
|
After any P1/P2 incident:
|
|
|
|
1. Write incident report within 24 hours
|
|
2. Identify root cause
|
|
3. Document timeline of events
|
|
4. List action items to prevent recurrence
|
|
5. Schedule review meeting if needed
|
|
6. Update runbooks if new procedures discovered
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- [ADR-015: Application Performance Monitoring](../adr/0015-application-performance-monitoring-and-error-tracking.md)
|
|
- [ADR-020: Health Checks](../adr/0020-health-checks-and-liveness-readiness-probes.md)
|
|
- [ADR-050: PostgreSQL Function Observability](../adr/0050-postgresql-function-observability.md)
|
|
- [ADR-053: Worker Health Checks](../adr/0053-worker-health-checks.md)
|
|
- [DEV-CONTAINER-BUGSINK.md](../DEV-CONTAINER-BUGSINK.md)
|
|
- [BUGSINK-SYNC.md](../BUGSINK-SYNC.md)
|
|
- [LOGSTASH-QUICK-REF.md](LOGSTASH-QUICK-REF.md)
|
|
- [LOGSTASH-TROUBLESHOOTING.md](LOGSTASH-TROUBLESHOOTING.md)
|
|
- [LOGSTASH_DEPLOYMENT_CHECKLIST.md](../LOGSTASH_DEPLOYMENT_CHECKLIST.md)
|