Enhance logging and error handling in PostgreSQL functions; update API endpoints in E2E tests; add Logstash troubleshooting documentation
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 18m25s
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 18m25s
- Added tiered logging and error handling in various PostgreSQL functions to improve observability and error tracking. - Updated E2E tests to reflect changes in API endpoints for fetching best watched prices. - Introduced a comprehensive troubleshooting runbook for Logstash to assist in diagnosing common issues in the PostgreSQL observability pipeline.
This commit is contained in:
@@ -1244,6 +1244,620 @@ If you only need application error tracking, the Sentry SDK integration is suffi
|
||||
|
||||
---
|
||||
|
||||
## PostgreSQL Function Observability (ADR-050)
|
||||
|
||||
PostgreSQL function observability provides structured logging and error tracking for database functions, preventing silent failures. This setup forwards database errors to Bugsink for centralized monitoring.
|
||||
|
||||
See [ADR-050](adr/0050-postgresql-function-observability.md) for the full architecture decision.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- PostgreSQL 14+ installed and running
|
||||
- Logstash installed and configured (see [Logstash section](#logstash-log-aggregation) above)
|
||||
- Bugsink running at `https://bugsink.projectium.com`
|
||||
|
||||
### Step 1: Configure PostgreSQL Logging
|
||||
|
||||
Create the observability configuration file:
|
||||
|
||||
```bash
|
||||
sudo nano /etc/postgresql/14/main/conf.d/observability.conf
|
||||
```
|
||||
|
||||
Add the following content:
|
||||
|
||||
```ini
|
||||
# PostgreSQL Logging Configuration for Database Function Observability (ADR-050)
|
||||
|
||||
# Enable logging to files for Logstash pickup
|
||||
logging_collector = on
|
||||
log_destination = 'stderr'
|
||||
log_directory = '/var/log/postgresql'
|
||||
log_filename = 'postgresql-%Y-%m-%d.log'
|
||||
log_rotation_age = 1d
|
||||
log_rotation_size = 100MB
|
||||
log_truncate_on_rotation = on
|
||||
|
||||
# Log level - capture NOTICE and above (includes fn_log WARNING/ERROR)
|
||||
log_min_messages = notice
|
||||
client_min_messages = notice
|
||||
|
||||
# Include useful context in log prefix
|
||||
log_line_prefix = '%t [%p] %u@%d '
|
||||
|
||||
# Capture slow queries from functions (1 second threshold)
|
||||
log_min_duration_statement = 1000
|
||||
|
||||
# Log statement types (off for production)
|
||||
log_statement = 'none'
|
||||
|
||||
# Connection logging (off for production to reduce noise)
|
||||
log_connections = off
|
||||
log_disconnections = off
|
||||
```
|
||||
|
||||
Set up the log directory:
|
||||
|
||||
```bash
|
||||
# Create log directory
|
||||
sudo mkdir -p /var/log/postgresql
|
||||
|
||||
# Set ownership to postgres user
|
||||
sudo chown postgres:postgres /var/log/postgresql
|
||||
sudo chmod 750 /var/log/postgresql
|
||||
```
|
||||
|
||||
Restart PostgreSQL:
|
||||
|
||||
```bash
|
||||
sudo systemctl restart postgresql
|
||||
```
|
||||
|
||||
Verify logging is working:
|
||||
|
||||
```bash
|
||||
# Check that log files are being created
|
||||
ls -la /var/log/postgresql/
|
||||
|
||||
# Should see files like: postgresql-2026-01-20.log
|
||||
```
|
||||
|
||||
### Step 2: Configure Logstash for PostgreSQL Logs
|
||||
|
||||
The Logstash configuration is located at `/etc/logstash/conf.d/bugsink.conf`.
|
||||
|
||||
**Key features:**
|
||||
|
||||
- Parses PostgreSQL log format with grok patterns
|
||||
- Extracts JSON from `fn_log()` function calls
|
||||
- Tags WARNING/ERROR level logs
|
||||
- Routes production database errors to Bugsink project 1
|
||||
- Routes test database errors to Bugsink project 3
|
||||
- Transforms events to Sentry-compatible format
|
||||
|
||||
**Configuration file:** `/etc/logstash/conf.d/bugsink.conf`
|
||||
|
||||
See the [Logstash Configuration Reference](#logstash-configuration-reference) below for the complete configuration.
|
||||
|
||||
**Grant Logstash access to PostgreSQL logs:**
|
||||
|
||||
```bash
|
||||
# Add logstash user to postgres group
|
||||
sudo usermod -aG postgres logstash
|
||||
|
||||
# Verify group membership
|
||||
groups logstash
|
||||
|
||||
# Restart Logstash to apply changes
|
||||
sudo systemctl restart logstash
|
||||
```
|
||||
|
||||
### Step 3: Test the Pipeline
|
||||
|
||||
Test structured logging from PostgreSQL:
|
||||
|
||||
```bash
|
||||
# Production database (routes to Bugsink project 1)
|
||||
sudo -u postgres psql -d flyer-crawler-prod -c "SELECT fn_log('WARNING', 'test_observability', 'Testing PostgreSQL observability pipeline', '{\"environment\": \"production\"}'::jsonb);"
|
||||
|
||||
# Test database (routes to Bugsink project 3)
|
||||
sudo -u postgres psql -d flyer-crawler-test -c "SELECT fn_log('WARNING', 'test_observability', 'Testing PostgreSQL observability pipeline', '{\"environment\": \"test\"}'::jsonb);"
|
||||
```
|
||||
|
||||
Check Bugsink UI:
|
||||
|
||||
- Production errors: <https://bugsink.projectium.com> → Project 1 (flyer-crawler-backend)
|
||||
- Test errors: <https://bugsink.projectium.com> → Project 3 (flyer-crawler-backend-test)
|
||||
|
||||
### Step 4: Verify Database Functions
|
||||
|
||||
The following critical functions use `fn_log()` for observability:
|
||||
|
||||
| Function | What it logs |
|
||||
| -------------------------- | ---------------------------------------- |
|
||||
| `award_achievement()` | Missing achievements, duplicate awards |
|
||||
| `fork_recipe()` | Missing original recipes |
|
||||
| `handle_new_user()` | User creation events |
|
||||
| `approve_correction()` | Permission denied, corrections not found |
|
||||
| `complete_shopping_list()` | Permission checks, list not found |
|
||||
|
||||
Test error logging with a database function:
|
||||
|
||||
```bash
|
||||
# Try to award a non-existent achievement (should fail and log to Bugsink)
|
||||
sudo -u postgres psql -d flyer-crawler-test -c "SELECT award_achievement('00000000-0000-0000-0000-000000000000'::uuid, 'NonexistentBadge');"
|
||||
|
||||
# Check Bugsink project 3 - should see an ERROR with full context
|
||||
```
|
||||
|
||||
### Logstash Configuration Reference
|
||||
|
||||
Complete configuration for PostgreSQL observability (`/etc/logstash/conf.d/bugsink.conf`):
|
||||
|
||||
```conf
|
||||
input {
|
||||
# PostgreSQL function logs (ADR-050)
|
||||
# Both production and test databases write to the same log files
|
||||
file {
|
||||
path => "/var/log/postgresql/*.log"
|
||||
type => "postgres"
|
||||
tags => ["postgres", "database"]
|
||||
start_position => "beginning"
|
||||
sincedb_path => "/var/lib/logstash/sincedb_postgres"
|
||||
}
|
||||
}
|
||||
|
||||
filter {
|
||||
# PostgreSQL function log parsing (ADR-050)
|
||||
if [type] == "postgres" {
|
||||
|
||||
# Extract timestamp, timezone, process ID, user, database, level, and message
|
||||
grok {
|
||||
match => { "message" => "%{TIMESTAMP_ISO8601:pg_timestamp} [+-]%{INT:pg_timezone} \[%{POSINT:pg_pid}\] %{DATA:pg_user}@%{DATA:pg_database} %{WORD:pg_level}: %{GREEDYDATA:pg_message}" }
|
||||
}
|
||||
|
||||
# Try to parse pg_message as JSON (from fn_log())
|
||||
if [pg_message] =~ /^\{/ {
|
||||
json {
|
||||
source => "pg_message"
|
||||
target => "fn_log"
|
||||
skip_on_invalid_json => true
|
||||
}
|
||||
|
||||
# Mark as error if level is WARNING or ERROR
|
||||
if [fn_log][level] in ["WARNING", "ERROR"] {
|
||||
mutate { add_tag => ["error", "db_function"] }
|
||||
}
|
||||
}
|
||||
|
||||
# Also catch native PostgreSQL errors
|
||||
if [pg_level] in ["ERROR", "FATAL"] {
|
||||
mutate { add_tag => ["error", "postgres_native"] }
|
||||
}
|
||||
|
||||
# Detect environment from database name
|
||||
if [pg_database] == "flyer-crawler-prod" {
|
||||
mutate {
|
||||
add_tag => ["production"]
|
||||
}
|
||||
} else if [pg_database] == "flyer-crawler-test" {
|
||||
mutate {
|
||||
add_tag => ["test"]
|
||||
}
|
||||
}
|
||||
|
||||
# Generate event_id for Sentry
|
||||
if "error" in [tags] {
|
||||
uuid {
|
||||
target => "[@metadata][event_id]"
|
||||
overwrite => true
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
output {
|
||||
# Production database errors -> project 1 (flyer-crawler-backend)
|
||||
if "error" in [tags] and "production" in [tags] {
|
||||
http {
|
||||
url => "https://bugsink.projectium.com/api/1/store/"
|
||||
http_method => "post"
|
||||
format => "json"
|
||||
headers => {
|
||||
"X-Sentry-Auth" => "Sentry sentry_version=7, sentry_client=logstash/1.0, sentry_key=911aef02b9a548fa8fabb8a3c81abfe5"
|
||||
"Content-Type" => "application/json"
|
||||
}
|
||||
mapping => {
|
||||
"event_id" => "%{[@metadata][event_id]}"
|
||||
"timestamp" => "%{@timestamp}"
|
||||
"platform" => "other"
|
||||
"level" => "error"
|
||||
"logger" => "postgresql"
|
||||
"message" => "%{[fn_log][message]}"
|
||||
"environment" => "production"
|
||||
"extra" => {
|
||||
"pg_user" => "%{[pg_user]}"
|
||||
"pg_database" => "%{[pg_database]}"
|
||||
"pg_function" => "%{[fn_log][function]}"
|
||||
"pg_level" => "%{[pg_level]}"
|
||||
"context" => "%{[fn_log][context]}"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Test database errors -> project 3 (flyer-crawler-backend-test)
|
||||
if "error" in [tags] and "test" in [tags] {
|
||||
http {
|
||||
url => "https://bugsink.projectium.com/api/3/store/"
|
||||
http_method => "post"
|
||||
format => "json"
|
||||
headers => {
|
||||
"X-Sentry-Auth" => "Sentry sentry_version=7, sentry_client=logstash/1.0, sentry_key=cdb99c314589431e83d4cc38a809449b"
|
||||
"Content-Type" => "application/json"
|
||||
}
|
||||
mapping => {
|
||||
"event_id" => "%{[@metadata][event_id]}"
|
||||
"timestamp" => "%{@timestamp}"
|
||||
"platform" => "other"
|
||||
"level" => "error"
|
||||
"logger" => "postgresql"
|
||||
"message" => "%{[fn_log][message]}"
|
||||
"environment" => "test"
|
||||
"extra" => {
|
||||
"pg_user" => "%{[pg_user]}"
|
||||
"pg_database" => "%{[pg_database]}"
|
||||
"pg_function" => "%{[fn_log][function]}"
|
||||
"pg_level" => "%{[pg_level]}"
|
||||
"context" => "%{[fn_log][context]}"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Extended Logstash Configuration (PM2, Redis, NGINX)
|
||||
|
||||
The complete production Logstash configuration includes additional log sources beyond PostgreSQL:
|
||||
|
||||
**Input Sources:**
|
||||
|
||||
```conf
|
||||
input {
|
||||
# PostgreSQL function logs (shown above)
|
||||
|
||||
# PM2 Worker stdout logs (production)
|
||||
file {
|
||||
path => "/home/gitea-runner/.pm2/logs/flyer-crawler-worker-*.log"
|
||||
type => "pm2_stdout"
|
||||
tags => ["infra", "pm2", "worker", "production"]
|
||||
start_position => "end"
|
||||
sincedb_path => "/var/lib/logstash/sincedb_pm2_worker_prod"
|
||||
exclude => "*-test-*.log"
|
||||
}
|
||||
|
||||
# PM2 Analytics Worker stdout (production)
|
||||
file {
|
||||
path => "/home/gitea-runner/.pm2/logs/flyer-crawler-analytics-worker-*.log"
|
||||
type => "pm2_stdout"
|
||||
tags => ["infra", "pm2", "analytics", "production"]
|
||||
start_position => "end"
|
||||
sincedb_path => "/var/lib/logstash/sincedb_pm2_analytics_prod"
|
||||
exclude => "*-test-*.log"
|
||||
}
|
||||
|
||||
# PM2 Worker stdout (test environment)
|
||||
file {
|
||||
path => "/home/gitea-runner/.pm2/logs/flyer-crawler-worker-test-*.log"
|
||||
type => "pm2_stdout"
|
||||
tags => ["infra", "pm2", "worker", "test"]
|
||||
start_position => "end"
|
||||
sincedb_path => "/var/lib/logstash/sincedb_pm2_worker_test"
|
||||
}
|
||||
|
||||
# PM2 Analytics Worker stdout (test environment)
|
||||
file {
|
||||
path => "/home/gitea-runner/.pm2/logs/flyer-crawler-analytics-worker-test-*.log"
|
||||
type => "pm2_stdout"
|
||||
tags => ["infra", "pm2", "analytics", "test"]
|
||||
start_position => "end"
|
||||
sincedb_path => "/var/lib/logstash/sincedb_pm2_analytics_test"
|
||||
}
|
||||
|
||||
# Redis logs (already configured)
|
||||
file {
|
||||
path => "/var/log/redis/redis-server.log"
|
||||
type => "redis"
|
||||
tags => ["infra", "redis"]
|
||||
start_position => "end"
|
||||
sincedb_path => "/var/lib/logstash/sincedb_redis"
|
||||
}
|
||||
|
||||
# NGINX access logs
|
||||
file {
|
||||
path => "/var/log/nginx/access.log"
|
||||
type => "nginx_access"
|
||||
tags => ["infra", "nginx", "access"]
|
||||
start_position => "end"
|
||||
sincedb_path => "/var/lib/logstash/sincedb_nginx_access"
|
||||
}
|
||||
|
||||
# NGINX error logs
|
||||
file {
|
||||
path => "/var/log/nginx/error.log"
|
||||
type => "nginx_error"
|
||||
tags => ["infra", "nginx", "error"]
|
||||
start_position => "end"
|
||||
sincedb_path => "/var/lib/logstash/sincedb_nginx_error"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Filter Rules:**
|
||||
|
||||
```conf
|
||||
filter {
|
||||
# PostgreSQL filters (shown above)
|
||||
|
||||
# PM2 Worker log parsing
|
||||
if [type] == "pm2_stdout" {
|
||||
# Try to parse as JSON first (if worker uses Pino)
|
||||
json {
|
||||
source => "message"
|
||||
target => "pm2_json"
|
||||
skip_on_invalid_json => true
|
||||
}
|
||||
|
||||
# If JSON parsing succeeded, extract level and tag errors
|
||||
if [pm2_json][level] {
|
||||
if [pm2_json][level] >= 50 {
|
||||
mutate { add_tag => ["error"] }
|
||||
}
|
||||
}
|
||||
# If not JSON, check for error keywords in plain text
|
||||
else if [message] =~ /(Error|ERROR|Exception|EXCEPTION|Fatal|FATAL|failed|FAILED)/ {
|
||||
mutate { add_tag => ["error"] }
|
||||
}
|
||||
|
||||
# Generate event_id for errors
|
||||
if "error" in [tags] {
|
||||
uuid {
|
||||
target => "[@metadata][event_id]"
|
||||
overwrite => true
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Redis log parsing
|
||||
if [type] == "redis" {
|
||||
grok {
|
||||
match => { "message" => "%{POSINT:pid}:%{WORD:role} %{MONTHDAY} %{MONTH} %{TIME} %{WORD:loglevel} %{GREEDYDATA:redis_message}" }
|
||||
}
|
||||
|
||||
# Tag errors (WARNING/ERROR) for Bugsink forwarding
|
||||
if [loglevel] in ["WARNING", "ERROR"] {
|
||||
mutate { add_tag => ["error"] }
|
||||
uuid {
|
||||
target => "[@metadata][event_id]"
|
||||
overwrite => true
|
||||
}
|
||||
}
|
||||
# Tag INFO-level operational events (startup, config, persistence)
|
||||
else if [loglevel] == "INFO" {
|
||||
mutate { add_tag => ["redis_operational"] }
|
||||
}
|
||||
}
|
||||
|
||||
# NGINX access log parsing
|
||||
if [type] == "nginx_access" {
|
||||
grok {
|
||||
match => { "message" => "%{COMBINEDAPACHELOG}" }
|
||||
}
|
||||
|
||||
# Parse response time if available (requires NGINX log format with request_time)
|
||||
if [message] =~ /request_time:(\d+\.\d+)/ {
|
||||
grok {
|
||||
match => { "message" => "request_time:(?<request_time_seconds>\d+\.\d+)" }
|
||||
}
|
||||
}
|
||||
|
||||
# Categorize by status code
|
||||
if [response] =~ /^5\d{2}$/ {
|
||||
mutate { add_tag => ["error", "http_5xx"] }
|
||||
uuid {
|
||||
target => "[@metadata][event_id]"
|
||||
overwrite => true
|
||||
}
|
||||
}
|
||||
else if [response] =~ /^4\d{2}$/ {
|
||||
mutate { add_tag => ["client_error", "http_4xx"] }
|
||||
}
|
||||
else if [response] =~ /^2\d{2}$/ {
|
||||
mutate { add_tag => ["success", "http_2xx"] }
|
||||
}
|
||||
else if [response] =~ /^3\d{2}$/ {
|
||||
mutate { add_tag => ["redirect", "http_3xx"] }
|
||||
}
|
||||
|
||||
# Tag slow requests (>1 second response time)
|
||||
if [request_time_seconds] and [request_time_seconds] > 1.0 {
|
||||
mutate { add_tag => ["slow_request"] }
|
||||
}
|
||||
|
||||
# Always tag for monitoring
|
||||
mutate { add_tag => ["access_log"] }
|
||||
}
|
||||
|
||||
# NGINX error log parsing
|
||||
if [type] == "nginx_error" {
|
||||
mutate { add_tag => ["error"] }
|
||||
uuid {
|
||||
target => "[@metadata][event_id]"
|
||||
overwrite => true
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Output Rules:**
|
||||
|
||||
```conf
|
||||
output {
|
||||
# Production errors -> Bugsink infrastructure project (5)
|
||||
# Includes: PM2 worker errors, Redis errors, NGINX 5xx, PostgreSQL errors
|
||||
if "error" in [tags] and "infra" in [tags] and "production" in [tags] {
|
||||
http {
|
||||
url => "https://bugsink.projectium.com/api/5/store/"
|
||||
http_method => "post"
|
||||
format => "json"
|
||||
headers => {
|
||||
"X-Sentry-Auth" => "Sentry sentry_version=7, sentry_client=logstash/1.0, sentry_key=b083076f94fb461b889d5dffcbef43bf"
|
||||
"Content-Type" => "application/json"
|
||||
}
|
||||
mapping => {
|
||||
"event_id" => "%{[@metadata][event_id]}"
|
||||
"timestamp" => "%{@timestamp}"
|
||||
"platform" => "other"
|
||||
"level" => "error"
|
||||
"logger" => "%{type}"
|
||||
"message" => "%{message}"
|
||||
"environment" => "production"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Test errors -> Bugsink test infrastructure project (6)
|
||||
if "error" in [tags] and "infra" in [tags] and "test" in [tags] {
|
||||
http {
|
||||
url => "https://bugsink.projectium.com/api/6/store/"
|
||||
http_method => "post"
|
||||
format => "json"
|
||||
headers => {
|
||||
"X-Sentry-Auth" => "Sentry sentry_version=7, sentry_client=logstash/1.0, sentry_key=25020dd6c2b74ad78463ec90e90fadab"
|
||||
"Content-Type" => "application/json"
|
||||
}
|
||||
mapping => {
|
||||
"event_id" => "%{[@metadata][event_id]}"
|
||||
"timestamp" => "%{@timestamp}"
|
||||
"platform" => "other"
|
||||
"level" => "error"
|
||||
"logger" => "%{type}"
|
||||
"message" => "%{message}"
|
||||
"environment" => "test"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# PM2 worker operational logs (non-errors) -> file
|
||||
if [type] == "pm2_stdout" and "error" not in [tags] {
|
||||
file {
|
||||
path => "/var/log/logstash/pm2-workers-%{+YYYY-MM-dd}.log"
|
||||
codec => json_lines
|
||||
}
|
||||
}
|
||||
|
||||
# Redis INFO logs (operational events) -> file
|
||||
if "redis_operational" in [tags] {
|
||||
file {
|
||||
path => "/var/log/logstash/redis-operational-%{+YYYY-MM-dd}.log"
|
||||
codec => json_lines
|
||||
}
|
||||
}
|
||||
|
||||
# NGINX access logs (all requests) -> file
|
||||
if "access_log" in [tags] {
|
||||
file {
|
||||
path => "/var/log/logstash/nginx-access-%{+YYYY-MM-dd}.log"
|
||||
codec => json_lines
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Setup Instructions:**
|
||||
|
||||
1. Create log output directory:
|
||||
|
||||
```bash
|
||||
sudo mkdir -p /var/log/logstash
|
||||
sudo chown logstash:logstash /var/log/logstash
|
||||
```
|
||||
|
||||
2. Configure logrotate for Logstash file outputs:
|
||||
|
||||
```bash
|
||||
sudo tee /etc/logrotate.d/logstash <<EOF
|
||||
/var/log/logstash/*.log {
|
||||
daily
|
||||
rotate 30
|
||||
compress
|
||||
delaycompress
|
||||
missingok
|
||||
notifempty
|
||||
create 0644 logstash logstash
|
||||
}
|
||||
EOF
|
||||
```
|
||||
|
||||
3. Verify Logstash can read PM2 logs:
|
||||
|
||||
```bash
|
||||
# Add logstash to required groups
|
||||
sudo usermod -a -G postgres logstash
|
||||
sudo usermod -a -G adm logstash
|
||||
|
||||
# Test permissions
|
||||
sudo -u logstash cat /home/gitea-runner/.pm2/logs/flyer-crawler-worker-*.log | head -5
|
||||
sudo -u logstash cat /var/log/redis/redis-server.log | head -5
|
||||
sudo -u logstash cat /var/log/nginx/access.log | head -5
|
||||
```
|
||||
|
||||
4. Restart Logstash:
|
||||
|
||||
```bash
|
||||
sudo systemctl restart logstash
|
||||
```
|
||||
|
||||
**Verification:**
|
||||
|
||||
```bash
|
||||
# Check Logstash is processing new log sources
|
||||
curl -s http://localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.events'
|
||||
|
||||
# Check file outputs
|
||||
ls -lh /var/log/logstash/
|
||||
tail -f /var/log/logstash/pm2-workers-$(date +%Y-%m-%d).log
|
||||
tail -f /var/log/logstash/redis-operational-$(date +%Y-%m-%d).log
|
||||
tail -f /var/log/logstash/nginx-access-$(date +%Y-%m-d).log
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
| Issue | Solution |
|
||||
| ------------------------------ | --------------------------------------------------------------------------------------------------- |
|
||||
| No logs appearing in Bugsink | Check Logstash status: `sudo journalctl -u logstash -f` |
|
||||
| Permission denied errors | Verify logstash is in postgres group: `groups logstash` |
|
||||
| Grok parse failures | Check Logstash stats: `curl -s http://localhost:9600/_node/stats/pipelines?pretty \| grep failures` |
|
||||
| Wrong Bugsink project | Verify database name detection in filter (flyer-crawler-prod vs flyer-crawler-test) |
|
||||
| PostgreSQL logs not created | Check `logging_collector = on` and restart PostgreSQL |
|
||||
| Events not formatted correctly | Check mapping in output section matches Sentry event schema |
|
||||
| Test config before restarting | Run: `/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf` |
|
||||
|
||||
### Maintenance Commands
|
||||
|
||||
| Task | Command |
|
||||
| ----------------------------- | ---------------------------------------------------------------------------------------------- |
|
||||
| View Logstash status | `sudo systemctl status logstash` |
|
||||
| View Logstash logs | `sudo journalctl -u logstash -f` |
|
||||
| View PostgreSQL logs | `tail -f /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log` |
|
||||
| Test Logstash config | `/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf` |
|
||||
| Restart Logstash | `sudo systemctl restart logstash` |
|
||||
| Check Logstash pipeline stats | `curl -s http://localhost:9600/_node/stats/pipelines?pretty` |
|
||||
| Clear sincedb (re-read logs) | `sudo rm /var/lib/logstash/sincedb_postgres && sudo systemctl restart logstash` |
|
||||
|
||||
---
|
||||
|
||||
## SSL/TLS with Let's Encrypt
|
||||
|
||||
### Install Certbot
|
||||
|
||||
460
docs/LOGSTASH-TROUBLESHOOTING.md
Normal file
460
docs/LOGSTASH-TROUBLESHOOTING.md
Normal file
@@ -0,0 +1,460 @@
|
||||
# Logstash Troubleshooting Runbook
|
||||
|
||||
This runbook provides step-by-step diagnostics and solutions for common Logstash issues in the PostgreSQL observability pipeline (ADR-050).
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Symptom | Most Likely Cause | Quick Check |
|
||||
| ------------------------ | ---------------------------- | ------------------------------------- |
|
||||
| No errors in Bugsink | Logstash not running | `systemctl status logstash` |
|
||||
| Events not processed | Grok pattern mismatch | Check filter failures in stats |
|
||||
| Wrong Bugsink project | Environment detection failed | Verify `pg_database` field extraction |
|
||||
| 403 authentication error | Missing/wrong DSN key | Check `X-Sentry-Auth` header |
|
||||
| 500 error from Bugsink | Invalid event format | Verify `event_id` and required fields |
|
||||
|
||||
---
|
||||
|
||||
## Diagnostic Steps
|
||||
|
||||
### 1. Verify Logstash is Running
|
||||
|
||||
```bash
|
||||
# Check service status
|
||||
systemctl status logstash
|
||||
|
||||
# If stopped, start it
|
||||
systemctl start logstash
|
||||
|
||||
# View recent logs
|
||||
journalctl -u logstash -n 50 --no-pager
|
||||
```
|
||||
|
||||
**Expected output:**
|
||||
|
||||
- Status: `active (running)`
|
||||
- No error messages in recent logs
|
||||
|
||||
---
|
||||
|
||||
### 2. Check Configuration Syntax
|
||||
|
||||
```bash
|
||||
# Test configuration file
|
||||
/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf
|
||||
```
|
||||
|
||||
**Expected output:**
|
||||
|
||||
```
|
||||
Configuration OK
|
||||
```
|
||||
|
||||
**If syntax errors:**
|
||||
|
||||
1. Review error message for line number
|
||||
2. Check for missing braces, quotes, or commas
|
||||
3. Verify plugin names are correct (e.g., `json`, `grok`, `uuid`, `http`)
|
||||
|
||||
---
|
||||
|
||||
### 3. Verify PostgreSQL Logs Are Being Read
|
||||
|
||||
```bash
|
||||
# Check if log file exists and has content
|
||||
ls -lh /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log
|
||||
|
||||
# Check Logstash can read the file
|
||||
sudo -u logstash cat /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | head -10
|
||||
```
|
||||
|
||||
**Expected output:**
|
||||
|
||||
- Log file exists and is not empty
|
||||
- Logstash user can read the file without permission errors
|
||||
|
||||
**If permission denied:**
|
||||
|
||||
```bash
|
||||
# Check Logstash is in postgres group
|
||||
groups logstash
|
||||
|
||||
# Should show: logstash : logstash adm postgres
|
||||
|
||||
# If not, add to group
|
||||
usermod -a -G postgres logstash
|
||||
systemctl restart logstash
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. Check Logstash Pipeline Stats
|
||||
|
||||
```bash
|
||||
# Get pipeline statistics
|
||||
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty' | jq '.pipelines.main.plugins.filters'
|
||||
```
|
||||
|
||||
**Key metrics to check:**
|
||||
|
||||
1. **Grok filter events:**
|
||||
- `"events.in"` - Total events received
|
||||
- `"events.out"` - Events successfully parsed
|
||||
- `"failures"` - Events that failed to parse
|
||||
|
||||
**If failures > 0:** Grok pattern doesn't match log format. Check PostgreSQL log format.
|
||||
|
||||
2. **JSON filter events:**
|
||||
- `"events.in"` - Events received by JSON parser
|
||||
- `"events.out"` - Successfully parsed JSON
|
||||
|
||||
**If events.in = 0:** Regex check `pg_message =~ /^\{/` is not matching. Verify fn_log() output format.
|
||||
|
||||
3. **UUID filter events:**
|
||||
- Should match number of errors being forwarded
|
||||
|
||||
---
|
||||
|
||||
### 5. Test Grok Pattern Manually
|
||||
|
||||
```bash
|
||||
# Get a sample log line
|
||||
tail -1 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log
|
||||
|
||||
# Example expected format:
|
||||
# 2026-01-20 10:30:00 +05 [12345] flyer_crawler_prod@flyer-crawler-prod WARNING: {"level":"WARNING","source":"postgresql",...}
|
||||
```
|
||||
|
||||
**Pattern breakdown:**
|
||||
|
||||
```
|
||||
%{TIMESTAMP_ISO8601:pg_timestamp} # 2026-01-20 10:30:00
|
||||
[+-]%{INT:pg_timezone} # +05
|
||||
\[%{POSINT:pg_pid}\] # [12345]
|
||||
%{DATA:pg_user}@%{DATA:pg_database} # flyer_crawler_prod@flyer-crawler-prod
|
||||
%{WORD:pg_level}: # WARNING:
|
||||
%{GREEDYDATA:pg_message} # (rest of line)
|
||||
```
|
||||
|
||||
**If pattern doesn't match:**
|
||||
|
||||
1. Check PostgreSQL `log_line_prefix` setting in `/etc/postgresql/14/main/conf.d/observability.conf`
|
||||
2. Should be: `log_line_prefix = '%t [%p] %u@%d '`
|
||||
3. Restart PostgreSQL if changed: `systemctl restart postgresql`
|
||||
|
||||
---
|
||||
|
||||
### 6. Verify Environment Detection
|
||||
|
||||
```bash
|
||||
# Check recent PostgreSQL logs for database field
|
||||
tail -20 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep -E "flyer-crawler-(prod|test)"
|
||||
```
|
||||
|
||||
**Expected:**
|
||||
|
||||
- Production database: `flyer_crawler_prod@flyer-crawler-prod`
|
||||
- Test database: `flyer_crawler_test@flyer-crawler-test`
|
||||
|
||||
**If database name doesn't match:**
|
||||
|
||||
- Check database connection string in application
|
||||
- Verify `DB_DATABASE_PROD` and `DB_DATABASE_TEST` Gitea secrets
|
||||
|
||||
---
|
||||
|
||||
### 7. Test Bugsink API Connection
|
||||
|
||||
```bash
|
||||
# Test production endpoint
|
||||
curl -X POST https://bugsink.projectium.com/api/1/store/ \
|
||||
-H "X-Sentry-Auth: Sentry sentry_version=7, sentry_client=test/1.0, sentry_key=911aef02b9a548fa8fabb8a3c81abfe5" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"event_id": "12345678901234567890123456789012",
|
||||
"timestamp": "2026-01-20T10:30:00Z",
|
||||
"platform": "other",
|
||||
"level": "error",
|
||||
"logger": "test",
|
||||
"message": "Test error from troubleshooting"
|
||||
}'
|
||||
```
|
||||
|
||||
**Expected response:**
|
||||
|
||||
- HTTP 200 OK
|
||||
- Response body: `{"id": "..."}`
|
||||
|
||||
**If 403 Forbidden:**
|
||||
|
||||
- DSN key is wrong in `/etc/logstash/conf.d/bugsink.conf`
|
||||
- Get correct key from Bugsink UI: Settings → Projects → DSN
|
||||
|
||||
**If 500 Internal Server Error:**
|
||||
|
||||
- Missing required fields (event_id, timestamp, level)
|
||||
- Check `mapping` section in Logstash config
|
||||
|
||||
---
|
||||
|
||||
### 8. Monitor Logstash Output in Real-Time
|
||||
|
||||
```bash
|
||||
# Watch Logstash processing logs
|
||||
journalctl -u logstash -f
|
||||
```
|
||||
|
||||
**What to look for:**
|
||||
|
||||
- `"response code => 200"` - Successful forwarding to Bugsink
|
||||
- `"response code => 403"` - Authentication failure
|
||||
- `"response code => 500"` - Invalid event format
|
||||
- Grok parse failures
|
||||
|
||||
---
|
||||
|
||||
## Common Issues and Solutions
|
||||
|
||||
### Issue 1: Grok Pattern Parse Failures
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Logstash stats show increasing `"failures"` count
|
||||
- No events reaching Bugsink
|
||||
|
||||
**Diagnosis:**
|
||||
|
||||
```bash
|
||||
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty' | jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | .failures'
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
|
||||
1. Check PostgreSQL log format matches expected pattern
|
||||
2. Verify `log_line_prefix` in PostgreSQL config
|
||||
3. Test with sample log line using Grok Debugger (Kibana Dev Tools)
|
||||
|
||||
---
|
||||
|
||||
### Issue 2: JSON Filter Not Parsing fn_log() Output
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Grok parses successfully but JSON filter shows 0 events
|
||||
- `[fn_log]` fields missing in Logstash output
|
||||
|
||||
**Diagnosis:**
|
||||
|
||||
```bash
|
||||
# Check if pg_message field contains JSON
|
||||
tail -20 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep "WARNING:" | grep "{"
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
|
||||
1. Verify `fn_log()` function exists in database:
|
||||
```sql
|
||||
\df fn_log
|
||||
```
|
||||
2. Test `fn_log()` output format:
|
||||
```sql
|
||||
SELECT fn_log('WARNING', 'test', 'Test message', '{"key":"value"}'::jsonb);
|
||||
```
|
||||
3. Check logs show JSON output starting with `{`
|
||||
|
||||
---
|
||||
|
||||
### Issue 3: Events Going to Wrong Bugsink Project
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Production errors appear in test project (or vice versa)
|
||||
|
||||
**Diagnosis:**
|
||||
|
||||
```bash
|
||||
# Check database name detection in recent logs
|
||||
tail -50 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep -E "(flyer-crawler-prod|flyer-crawler-test)"
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
|
||||
1. Verify database names in filter section match actual database names
|
||||
2. Check `pg_database` field is correctly extracted by grok pattern:
|
||||
```bash
|
||||
# Enable debug output in Logstash config temporarily
|
||||
stdout { codec => rubydebug { metadata => true } }
|
||||
```
|
||||
3. Verify environment tagging in filter:
|
||||
- `pg_database == "flyer-crawler-prod"` → adds "production" tag → routes to project 1
|
||||
- `pg_database == "flyer-crawler-test"` → adds "test" tag → routes to project 3
|
||||
|
||||
---
|
||||
|
||||
### Issue 4: 403 Authentication Errors from Bugsink
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Logstash logs show `response code => 403`
|
||||
- Events not appearing in Bugsink
|
||||
|
||||
**Diagnosis:**
|
||||
|
||||
```bash
|
||||
# Check Logstash output logs for authentication errors
|
||||
journalctl -u logstash -n 100 | grep "403"
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
|
||||
1. Verify DSN key in `/etc/logstash/conf.d/bugsink.conf` matches Bugsink project
|
||||
2. Get correct DSN from Bugsink UI:
|
||||
- Navigate to Settings → Projects → Click project
|
||||
- Copy "DSN" value
|
||||
- Extract key: `http://KEY@host/PROJECT_ID` → use KEY
|
||||
3. Update `X-Sentry-Auth` header in Logstash config:
|
||||
```conf
|
||||
"X-Sentry-Auth" => "Sentry sentry_version=7, sentry_client=logstash/1.0, sentry_key=YOUR_KEY_HERE"
|
||||
```
|
||||
4. Restart Logstash: `systemctl restart logstash`
|
||||
|
||||
---
|
||||
|
||||
### Issue 5: 500 Errors from Bugsink
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Logstash logs show `response code => 500`
|
||||
- Bugsink logs show validation errors
|
||||
|
||||
**Diagnosis:**
|
||||
|
||||
```bash
|
||||
# Check Bugsink logs for details
|
||||
docker logs bugsink-web 2>&1 | tail -50
|
||||
```
|
||||
|
||||
**Common causes:**
|
||||
|
||||
1. Missing `event_id` field
|
||||
2. Invalid timestamp format
|
||||
3. Missing required Sentry fields
|
||||
|
||||
**Solution:**
|
||||
|
||||
1. Verify `uuid` filter is generating `event_id`:
|
||||
```conf
|
||||
uuid {
|
||||
target => "[@metadata][event_id]"
|
||||
overwrite => true
|
||||
}
|
||||
```
|
||||
2. Check `mapping` section includes all required fields:
|
||||
- `event_id` (UUID)
|
||||
- `timestamp` (ISO 8601)
|
||||
- `platform` (string)
|
||||
- `level` (error/warning/info)
|
||||
- `logger` (string)
|
||||
- `message` (string)
|
||||
|
||||
---
|
||||
|
||||
### Issue 6: High Memory Usage by Logstash
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Server running out of memory
|
||||
- Logstash OOM killed
|
||||
|
||||
**Diagnosis:**
|
||||
|
||||
```bash
|
||||
# Check Logstash memory usage
|
||||
ps aux | grep logstash
|
||||
systemctl status logstash
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
|
||||
1. Limit Logstash heap size in `/etc/logstash/jvm.options`:
|
||||
```
|
||||
-Xms1g
|
||||
-Xmx1g
|
||||
```
|
||||
2. Restart Logstash: `systemctl restart logstash`
|
||||
3. Monitor with: `top -p $(pgrep -f logstash)`
|
||||
|
||||
---
|
||||
|
||||
### Issue 7: Log File Rotation Issues
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Logstash stops processing after log file rotates
|
||||
- Sincedb file pointing to old inode
|
||||
|
||||
**Diagnosis:**
|
||||
|
||||
```bash
|
||||
# Check sincedb file
|
||||
cat /var/lib/logstash/sincedb_postgres
|
||||
|
||||
# Check current log file inode
|
||||
ls -li /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
|
||||
1. Logstash should automatically detect rotation
|
||||
2. If stuck, delete sincedb file (will reprocess recent logs):
|
||||
```bash
|
||||
systemctl stop logstash
|
||||
rm /var/lib/logstash/sincedb_postgres
|
||||
systemctl start logstash
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
After making any changes, verify the pipeline is working:
|
||||
|
||||
- [ ] Logstash is running: `systemctl status logstash`
|
||||
- [ ] Configuration is valid: `/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf`
|
||||
- [ ] No grok failures: `curl localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | .failures'`
|
||||
- [ ] Events being processed: `curl localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.events'`
|
||||
- [ ] Test error appears in Bugsink: Trigger a database function error and check Bugsink UI
|
||||
|
||||
---
|
||||
|
||||
## Test Database Function Error
|
||||
|
||||
To generate a test error for verification:
|
||||
|
||||
```bash
|
||||
# Connect to production database
|
||||
sudo -u postgres psql -d flyer-crawler-prod
|
||||
|
||||
# Trigger an error (achievement not found)
|
||||
SELECT award_achievement('00000000-0000-0000-0000-000000000001'::uuid, 'Nonexistent Badge');
|
||||
\q
|
||||
```
|
||||
|
||||
**Expected flow:**
|
||||
|
||||
1. PostgreSQL logs the error to `/var/log/postgresql/postgresql-YYYY-MM-DD.log`
|
||||
2. Logstash reads and parses the log (within ~30 seconds)
|
||||
3. Error appears in Bugsink project 1 (production)
|
||||
|
||||
**If error doesn't appear:**
|
||||
|
||||
- Check each diagnostic step above
|
||||
- Review Logstash logs: `journalctl -u logstash -f`
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Setup Guide**: [docs/BARE-METAL-SETUP.md](BARE-METAL-SETUP.md) - PostgreSQL Function Observability section
|
||||
- **Architecture**: [docs/adr/0050-postgresql-function-observability.md](adr/0050-postgresql-function-observability.md)
|
||||
- **Configuration Reference**: [CLAUDE.md](../CLAUDE.md) - Logstash Configuration section
|
||||
- **Bugsink MCP Server**: [CLAUDE.md](../CLAUDE.md) - Sentry/Bugsink MCP Server Setup section
|
||||
Reference in New Issue
Block a user