Files
flyer-crawler.projectium.com/docs/operations/LOGSTASH-TROUBLESHOOTING.md
Torben Sorensen d4543cf4b9
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 19m12s
Bugsink Fixes
2026-01-22 21:55:18 -08:00

14 KiB

Logstash Troubleshooting Runbook

This runbook provides step-by-step diagnostics and solutions for common Logstash issues in the PostgreSQL observability pipeline (ADR-050).

Quick Reference

Symptom Most Likely Cause Quick Check
No errors in Bugsink Logstash not running systemctl status logstash
Events not processed Grok pattern mismatch Check filter failures in stats
Wrong Bugsink project Environment detection failed Verify pg_database field extraction
403 authentication error Missing/wrong DSN key Check X-Sentry-Auth header
500 error from Bugsink Invalid event format Verify event_id and required fields
varchar(7) constraint Unresolved %{sentry_level} Add Ruby filter for level validation

Diagnostic Steps

1. Verify Logstash is Running

# Check service status
systemctl status logstash

# If stopped, start it
systemctl start logstash

# View recent logs
journalctl -u logstash -n 50 --no-pager

Expected output:

  • Status: active (running)
  • No error messages in recent logs

2. Check Configuration Syntax

# Test configuration file
/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf

Expected output:

Configuration OK

If syntax errors:

  1. Review error message for line number
  2. Check for missing braces, quotes, or commas
  3. Verify plugin names are correct (e.g., json, grok, uuid, http)

3. Verify PostgreSQL Logs Are Being Read

# Check if log file exists and has content
ls -lh /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log

# Check Logstash can read the file
sudo -u logstash cat /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | head -10

Expected output:

  • Log file exists and is not empty
  • Logstash user can read the file without permission errors

If permission denied:

# Check Logstash is in postgres group
groups logstash

# Should show: logstash : logstash adm postgres

# If not, add to group
usermod -a -G postgres logstash
systemctl restart logstash

4. Check Logstash Pipeline Stats

# Get pipeline statistics
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty' | jq '.pipelines.main.plugins.filters'

Key metrics to check:

  1. Grok filter events:

    • "events.in" - Total events received
    • "events.out" - Events successfully parsed
    • "failures" - Events that failed to parse

    If failures > 0: Grok pattern doesn't match log format. Check PostgreSQL log format.

  2. JSON filter events:

    • "events.in" - Events received by JSON parser
    • "events.out" - Successfully parsed JSON

    If events.in = 0: Regex check pg_message =~ /^\{/ is not matching. Verify fn_log() output format.

  3. UUID filter events:

    • Should match number of errors being forwarded

5. Test Grok Pattern Manually

# Get a sample log line
tail -1 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log

# Example expected format:
# 2026-01-20 10:30:00 +05 [12345] flyer_crawler_prod@flyer-crawler-prod WARNING:  {"level":"WARNING","source":"postgresql",...}

Pattern breakdown:

%{TIMESTAMP_ISO8601:pg_timestamp}   # 2026-01-20 10:30:00
[+-]%{INT:pg_timezone}               # +05
\[%{POSINT:pg_pid}\]                 # [12345]
%{DATA:pg_user}@%{DATA:pg_database}  # flyer_crawler_prod@flyer-crawler-prod
%{WORD:pg_level}:                    # WARNING:
%{GREEDYDATA:pg_message}             # (rest of line)

If pattern doesn't match:

  1. Check PostgreSQL log_line_prefix setting in /etc/postgresql/14/main/conf.d/observability.conf
  2. Should be: log_line_prefix = '%t [%p] %u@%d '
  3. Restart PostgreSQL if changed: systemctl restart postgresql

6. Verify Environment Detection

# Check recent PostgreSQL logs for database field
tail -20 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep -E "flyer-crawler-(prod|test)"

Expected:

  • Production database: flyer_crawler_prod@flyer-crawler-prod
  • Test database: flyer_crawler_test@flyer-crawler-test

If database name doesn't match:

  • Check database connection string in application
  • Verify DB_DATABASE_PROD and DB_DATABASE_TEST Gitea secrets

7. Test Bugsink API Connection

# Test production endpoint
curl -X POST https://bugsink.projectium.com/api/1/store/ \
  -H "X-Sentry-Auth: Sentry sentry_version=7, sentry_client=test/1.0, sentry_key=911aef02b9a548fa8fabb8a3c81abfe5" \
  -H "Content-Type: application/json" \
  -d '{
    "event_id": "12345678901234567890123456789012",
    "timestamp": "2026-01-20T10:30:00Z",
    "platform": "other",
    "level": "error",
    "logger": "test",
    "message": "Test error from troubleshooting"
  }'

Expected response:

  • HTTP 200 OK
  • Response body: {"id": "..."}

If 403 Forbidden:

  • DSN key is wrong in /etc/logstash/conf.d/bugsink.conf
  • Get correct key from Bugsink UI: Settings → Projects → DSN

If 500 Internal Server Error:

  • Missing required fields (event_id, timestamp, level)
  • Check mapping section in Logstash config

8. Monitor Logstash Output in Real-Time

# Watch Logstash processing logs
journalctl -u logstash -f

What to look for:

  • "response code => 200" - Successful forwarding to Bugsink
  • "response code => 403" - Authentication failure
  • "response code => 500" - Invalid event format
  • Grok parse failures

Common Issues and Solutions

Issue 1: Grok Pattern Parse Failures

Symptoms:

  • Logstash stats show increasing "failures" count
  • No events reaching Bugsink

Diagnosis:

curl -XGET 'localhost:9600/_node/stats/pipelines?pretty' | jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | .failures'

Solution:

  1. Check PostgreSQL log format matches expected pattern
  2. Verify log_line_prefix in PostgreSQL config
  3. Test with sample log line using Grok Debugger (Kibana Dev Tools)

Issue 2: JSON Filter Not Parsing fn_log() Output

Symptoms:

  • Grok parses successfully but JSON filter shows 0 events
  • [fn_log] fields missing in Logstash output

Diagnosis:

# Check if pg_message field contains JSON
tail -20 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep "WARNING:" | grep "{"

Solution:

  1. Verify fn_log() function exists in database:
    \df fn_log
    
  2. Test fn_log() output format:
    SELECT fn_log('WARNING', 'test', 'Test message', '{"key":"value"}'::jsonb);
    
  3. Check logs show JSON output starting with {

Issue 3: Events Going to Wrong Bugsink Project

Symptoms:

  • Production errors appear in test project (or vice versa)

Diagnosis:

# Check database name detection in recent logs
tail -50 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep -E "(flyer-crawler-prod|flyer-crawler-test)"

Solution:

  1. Verify database names in filter section match actual database names
  2. Check pg_database field is correctly extracted by grok pattern:
    # Enable debug output in Logstash config temporarily
    stdout { codec => rubydebug { metadata => true } }
    
  3. Verify environment tagging in filter:
    • pg_database == "flyer-crawler-prod" → adds "production" tag → routes to project 1
    • pg_database == "flyer-crawler-test" → adds "test" tag → routes to project 3

Issue 4: 403 Authentication Errors from Bugsink

Symptoms:

  • Logstash logs show response code => 403
  • Events not appearing in Bugsink

Diagnosis:

# Check Logstash output logs for authentication errors
journalctl -u logstash -n 100 | grep "403"

Solution:

  1. Verify DSN key in /etc/logstash/conf.d/bugsink.conf matches Bugsink project
  2. Get correct DSN from Bugsink UI:
    • Navigate to Settings → Projects → Click project
    • Copy "DSN" value
    • Extract key: http://KEY@host/PROJECT_ID → use KEY
  3. Update X-Sentry-Auth header in Logstash config:
    "X-Sentry-Auth" => "Sentry sentry_version=7, sentry_client=logstash/1.0, sentry_key=YOUR_KEY_HERE"
    
  4. Restart Logstash: systemctl restart logstash

Issue 5: 500 Errors from Bugsink

Symptoms:

  • Logstash logs show response code => 500
  • Bugsink logs show validation errors

Diagnosis:

# Check Bugsink logs for details
docker logs bugsink-web 2>&1 | tail -50

Common causes:

  1. Missing event_id field
  2. Invalid timestamp format
  3. Missing required Sentry fields

Solution:

  1. Verify uuid filter is generating event_id:
    uuid {
      target => "[@metadata][event_id]"
      overwrite => true
    }
    
  2. Check mapping section includes all required fields:
    • event_id (UUID)
    • timestamp (ISO 8601)
    • platform (string)
    • level (error/warning/info)
    • logger (string)
    • message (string)

Issue 6: High Memory Usage by Logstash

Symptoms:

  • Server running out of memory
  • Logstash OOM killed

Diagnosis:

# Check Logstash memory usage
ps aux | grep logstash
systemctl status logstash

Solution:

  1. Limit Logstash heap size in /etc/logstash/jvm.options:
    -Xms1g
    -Xmx1g
    
  2. Restart Logstash: systemctl restart logstash
  3. Monitor with: top -p $(pgrep -f logstash)

Issue 7: Level Field Constraint Violation (varchar(7))

Symptoms:

  • Bugsink returns HTTP 500 errors
  • PostgreSQL errors: value too long for type character varying(7)
  • Events fail to insert with literal %{sentry_level} string (16 characters)

Root Cause:

When Logstash cannot determine the log level (no error patterns matched), the sentry_level field remains as the unresolved placeholder %{sentry_level}. Bugsink's PostgreSQL schema has a varchar(7) constraint on the level field.

Valid Sentry levels (all <= 7 characters): fatal, error, warning, info, debug

Diagnosis:

# Check for HTTP 500 responses in Logstash logs
podman exec flyer-crawler-dev cat /var/log/logstash/logstash.log | grep "500"

# Check Bugsink for constraint violation errors
# Via MCP:
mcp__localerrors__list_issues({ project_id: 1, status: 'unresolved' })

Solution:

Add a Ruby filter block in docker/logstash/bugsink.conf to validate and normalize the sentry_level field before sending to Bugsink:

# Add this AFTER all mutate filters that set sentry_level
# and BEFORE the output section

ruby {
    code => '
        level = event.get("sentry_level")
        # Check if level is invalid (nil, empty, contains placeholder, or too long)
        if level.nil? || level.to_s.empty? || level.to_s.include?("%{") || level.to_s.length > 7
            # Default to "error" for error-tagged events, "info" otherwise
            if event.get("tags")&.include?("error")
                event.set("sentry_level", "error")
            else
                event.set("sentry_level", "info")
            end
        else
            # Normalize to lowercase and validate
            normalized = level.to_s.downcase
            valid_levels = ["fatal", "error", "warning", "info", "debug"]
            unless valid_levels.include?(normalized)
                normalized = "error"
            end
            event.set("sentry_level", normalized)
        end
    '
}

Key validations performed:

  1. Checks for nil or empty values
  2. Detects unresolved placeholders (%{...})
  3. Enforces 7-character maximum length
  4. Normalizes to lowercase
  5. Validates against allowed Sentry levels
  6. Defaults to "error" for error-tagged events, "info" otherwise

Verification:

# Restart Logstash
podman exec flyer-crawler-dev systemctl restart logstash

# Generate a test log that triggers the filter
podman exec flyer-crawler-dev pm2 restart flyer-crawler-api-dev

# Check no new HTTP 500 errors
podman exec flyer-crawler-dev cat /var/log/logstash/logstash.log | tail -50 | grep -E "(500|error)"

Issue 8: Log File Rotation Issues

Symptoms:

  • Logstash stops processing after log file rotates
  • Sincedb file pointing to old inode

Diagnosis:

# Check sincedb file
cat /var/lib/logstash/sincedb_postgres

# Check current log file inode
ls -li /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log

Solution:

  1. Logstash should automatically detect rotation
  2. If stuck, delete sincedb file (will reprocess recent logs):
    systemctl stop logstash
    rm /var/lib/logstash/sincedb_postgres
    systemctl start logstash
    

Verification Checklist

After making any changes, verify the pipeline is working:

  • Logstash is running: systemctl status logstash
  • Configuration is valid: /usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf
  • No grok failures: curl localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | .failures'
  • Events being processed: curl localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.events'
  • Test error appears in Bugsink: Trigger a database function error and check Bugsink UI

Test Database Function Error

To generate a test error for verification:

# Connect to production database
sudo -u postgres psql -d flyer-crawler-prod

# Trigger an error (achievement not found)
SELECT award_achievement('00000000-0000-0000-0000-000000000001'::uuid, 'Nonexistent Badge');
\q

Expected flow:

  1. PostgreSQL logs the error to /var/log/postgresql/postgresql-YYYY-MM-DD.log
  2. Logstash reads and parses the log (within ~30 seconds)
  3. Error appears in Bugsink project 1 (production)

If error doesn't appear:

  • Check each diagnostic step above
  • Review Logstash logs: journalctl -u logstash -f