# Logstash Troubleshooting Runbook This runbook provides step-by-step diagnostics and solutions for common Logstash issues in the PostgreSQL observability pipeline (ADR-050). ## Quick Reference | Symptom | Most Likely Cause | Quick Check | | ------------------------ | ---------------------------- | ------------------------------------- | | No errors in Bugsink | Logstash not running | `systemctl status logstash` | | Events not processed | Grok pattern mismatch | Check filter failures in stats | | Wrong Bugsink project | Environment detection failed | Verify `pg_database` field extraction | | 403 authentication error | Missing/wrong DSN key | Check `X-Sentry-Auth` header | | 500 error from Bugsink | Invalid event format | Verify `event_id` and required fields | | varchar(7) constraint | Unresolved `%{sentry_level}` | Add Ruby filter for level validation | --- ## Diagnostic Steps ### 1. Verify Logstash is Running ```bash # Check service status systemctl status logstash # If stopped, start it systemctl start logstash # View recent logs journalctl -u logstash -n 50 --no-pager ``` **Expected output:** - Status: `active (running)` - No error messages in recent logs --- ### 2. Check Configuration Syntax ```bash # Test configuration file /usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf ``` **Expected output:** ``` Configuration OK ``` **If syntax errors:** 1. Review error message for line number 2. Check for missing braces, quotes, or commas 3. Verify plugin names are correct (e.g., `json`, `grok`, `uuid`, `http`) --- ### 3. Verify PostgreSQL Logs Are Being Read ```bash # Check if log file exists and has content ls -lh /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log # Check Logstash can read the file sudo -u logstash cat /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | head -10 ``` **Expected output:** - Log file exists and is not empty - Logstash user can read the file without permission errors **If permission denied:** ```bash # Check Logstash is in postgres group groups logstash # Should show: logstash : logstash adm postgres # If not, add to group usermod -a -G postgres logstash systemctl restart logstash ``` --- ### 4. Check Logstash Pipeline Stats ```bash # Get pipeline statistics curl -XGET 'localhost:9600/_node/stats/pipelines?pretty' | jq '.pipelines.main.plugins.filters' ``` **Key metrics to check:** 1. **Grok filter events:** - `"events.in"` - Total events received - `"events.out"` - Events successfully parsed - `"failures"` - Events that failed to parse **If failures > 0:** Grok pattern doesn't match log format. Check PostgreSQL log format. 2. **JSON filter events:** - `"events.in"` - Events received by JSON parser - `"events.out"` - Successfully parsed JSON **If events.in = 0:** Regex check `pg_message =~ /^\{/` is not matching. Verify fn_log() output format. 3. **UUID filter events:** - Should match number of errors being forwarded --- ### 5. Test Grok Pattern Manually ```bash # Get a sample log line tail -1 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log # Example expected format: # 2026-01-20 10:30:00 +05 [12345] flyer_crawler_prod@flyer-crawler-prod WARNING: {"level":"WARNING","source":"postgresql",...} ``` **Pattern breakdown:** ``` %{TIMESTAMP_ISO8601:pg_timestamp} # 2026-01-20 10:30:00 [+-]%{INT:pg_timezone} # +05 \[%{POSINT:pg_pid}\] # [12345] %{DATA:pg_user}@%{DATA:pg_database} # flyer_crawler_prod@flyer-crawler-prod %{WORD:pg_level}: # WARNING: %{GREEDYDATA:pg_message} # (rest of line) ``` **If pattern doesn't match:** 1. Check PostgreSQL `log_line_prefix` setting in `/etc/postgresql/14/main/conf.d/observability.conf` 2. Should be: `log_line_prefix = '%t [%p] %u@%d '` 3. Restart PostgreSQL if changed: `systemctl restart postgresql` --- ### 6. Verify Environment Detection ```bash # Check recent PostgreSQL logs for database field tail -20 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep -E "flyer-crawler-(prod|test)" ``` **Expected:** - Production database: `flyer_crawler_prod@flyer-crawler-prod` - Test database: `flyer_crawler_test@flyer-crawler-test` **If database name doesn't match:** - Check database connection string in application - Verify `DB_DATABASE_PROD` and `DB_DATABASE_TEST` Gitea secrets --- ### 7. Test Bugsink API Connection ```bash # Test production endpoint curl -X POST https://bugsink.projectium.com/api/1/store/ \ -H "X-Sentry-Auth: Sentry sentry_version=7, sentry_client=test/1.0, sentry_key=911aef02b9a548fa8fabb8a3c81abfe5" \ -H "Content-Type: application/json" \ -d '{ "event_id": "12345678901234567890123456789012", "timestamp": "2026-01-20T10:30:00Z", "platform": "other", "level": "error", "logger": "test", "message": "Test error from troubleshooting" }' ``` **Expected response:** - HTTP 200 OK - Response body: `{"id": "..."}` **If 403 Forbidden:** - DSN key is wrong in `/etc/logstash/conf.d/bugsink.conf` - Get correct key from Bugsink UI: Settings → Projects → DSN **If 500 Internal Server Error:** - Missing required fields (event_id, timestamp, level) - Check `mapping` section in Logstash config --- ### 8. Monitor Logstash Output in Real-Time ```bash # Watch Logstash processing logs journalctl -u logstash -f ``` **What to look for:** - `"response code => 200"` - Successful forwarding to Bugsink - `"response code => 403"` - Authentication failure - `"response code => 500"` - Invalid event format - Grok parse failures --- ## Common Issues and Solutions ### Issue 1: Grok Pattern Parse Failures **Symptoms:** - Logstash stats show increasing `"failures"` count - No events reaching Bugsink **Diagnosis:** ```bash curl -XGET 'localhost:9600/_node/stats/pipelines?pretty' | jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | .failures' ``` **Solution:** 1. Check PostgreSQL log format matches expected pattern 2. Verify `log_line_prefix` in PostgreSQL config 3. Test with sample log line using Grok Debugger (Kibana Dev Tools) --- ### Issue 2: JSON Filter Not Parsing fn_log() Output **Symptoms:** - Grok parses successfully but JSON filter shows 0 events - `[fn_log]` fields missing in Logstash output **Diagnosis:** ```bash # Check if pg_message field contains JSON tail -20 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep "WARNING:" | grep "{" ``` **Solution:** 1. Verify `fn_log()` function exists in database: ```sql \df fn_log ``` 2. Test `fn_log()` output format: ```sql SELECT fn_log('WARNING', 'test', 'Test message', '{"key":"value"}'::jsonb); ``` 3. Check logs show JSON output starting with `{` --- ### Issue 3: Events Going to Wrong Bugsink Project **Symptoms:** - Production errors appear in test project (or vice versa) **Diagnosis:** ```bash # Check database name detection in recent logs tail -50 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep -E "(flyer-crawler-prod|flyer-crawler-test)" ``` **Solution:** 1. Verify database names in filter section match actual database names 2. Check `pg_database` field is correctly extracted by grok pattern: ```bash # Enable debug output in Logstash config temporarily stdout { codec => rubydebug { metadata => true } } ``` 3. Verify environment tagging in filter: - `pg_database == "flyer-crawler-prod"` → adds "production" tag → routes to project 1 - `pg_database == "flyer-crawler-test"` → adds "test" tag → routes to project 3 --- ### Issue 4: 403 Authentication Errors from Bugsink **Symptoms:** - Logstash logs show `response code => 403` - Events not appearing in Bugsink **Diagnosis:** ```bash # Check Logstash output logs for authentication errors journalctl -u logstash -n 100 | grep "403" ``` **Solution:** 1. Verify DSN key in `/etc/logstash/conf.d/bugsink.conf` matches Bugsink project 2. Get correct DSN from Bugsink UI: - Navigate to Settings → Projects → Click project - Copy "DSN" value - Extract key: `http://KEY@host/PROJECT_ID` → use KEY 3. Update `X-Sentry-Auth` header in Logstash config: ```conf "X-Sentry-Auth" => "Sentry sentry_version=7, sentry_client=logstash/1.0, sentry_key=YOUR_KEY_HERE" ``` 4. Restart Logstash: `systemctl restart logstash` --- ### Issue 5: 500 Errors from Bugsink **Symptoms:** - Logstash logs show `response code => 500` - Bugsink logs show validation errors **Diagnosis:** ```bash # Check Bugsink logs for details docker logs bugsink-web 2>&1 | tail -50 ``` **Common causes:** 1. Missing `event_id` field 2. Invalid timestamp format 3. Missing required Sentry fields **Solution:** 1. Verify `uuid` filter is generating `event_id`: ```conf uuid { target => "[@metadata][event_id]" overwrite => true } ``` 2. Check `mapping` section includes all required fields: - `event_id` (UUID) - `timestamp` (ISO 8601) - `platform` (string) - `level` (error/warning/info) - `logger` (string) - `message` (string) --- ### Issue 6: High Memory Usage by Logstash **Symptoms:** - Server running out of memory - Logstash OOM killed **Diagnosis:** ```bash # Check Logstash memory usage ps aux | grep logstash systemctl status logstash ``` **Solution:** 1. Limit Logstash heap size in `/etc/logstash/jvm.options`: ``` -Xms1g -Xmx1g ``` 2. Restart Logstash: `systemctl restart logstash` 3. Monitor with: `top -p $(pgrep -f logstash)` --- ### Issue 7: Level Field Constraint Violation (varchar(7)) **Symptoms:** - Bugsink returns HTTP 500 errors - PostgreSQL errors: `value too long for type character varying(7)` - Events fail to insert with literal `%{sentry_level}` string (16 characters) **Root Cause:** When Logstash cannot determine the log level (no error patterns matched), the `sentry_level` field remains as the unresolved placeholder `%{sentry_level}`. Bugsink's PostgreSQL schema has a `varchar(7)` constraint on the level field. Valid Sentry levels (all <= 7 characters): `fatal`, `error`, `warning`, `info`, `debug` **Diagnosis:** ```bash # Check for HTTP 500 responses in Logstash logs podman exec flyer-crawler-dev cat /var/log/logstash/logstash.log | grep "500" # Check Bugsink for constraint violation errors # Via MCP: mcp__localerrors__list_issues({ project_id: 1, status: 'unresolved' }) ``` **Solution:** Add a Ruby filter block in `docker/logstash/bugsink.conf` to validate and normalize the `sentry_level` field before sending to Bugsink: ```ruby # Add this AFTER all mutate filters that set sentry_level # and BEFORE the output section ruby { code => ' level = event.get("sentry_level") # Check if level is invalid (nil, empty, contains placeholder, or too long) if level.nil? || level.to_s.empty? || level.to_s.include?("%{") || level.to_s.length > 7 # Default to "error" for error-tagged events, "info" otherwise if event.get("tags")&.include?("error") event.set("sentry_level", "error") else event.set("sentry_level", "info") end else # Normalize to lowercase and validate normalized = level.to_s.downcase valid_levels = ["fatal", "error", "warning", "info", "debug"] unless valid_levels.include?(normalized) normalized = "error" end event.set("sentry_level", normalized) end ' } ``` **Key validations performed:** 1. Checks for nil or empty values 2. Detects unresolved placeholders (`%{...}`) 3. Enforces 7-character maximum length 4. Normalizes to lowercase 5. Validates against allowed Sentry levels 6. Defaults to "error" for error-tagged events, "info" otherwise **Verification:** ```bash # Restart Logstash podman exec flyer-crawler-dev systemctl restart logstash # Generate a test log that triggers the filter podman exec flyer-crawler-dev pm2 restart flyer-crawler-api-dev # Check no new HTTP 500 errors podman exec flyer-crawler-dev cat /var/log/logstash/logstash.log | tail -50 | grep -E "(500|error)" ``` --- ### Issue 8: Log File Rotation Issues **Symptoms:** - Logstash stops processing after log file rotates - Sincedb file pointing to old inode **Diagnosis:** ```bash # Check sincedb file cat /var/lib/logstash/sincedb_postgres # Check current log file inode ls -li /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log ``` **Solution:** 1. Logstash should automatically detect rotation 2. If stuck, delete sincedb file (will reprocess recent logs): ```bash systemctl stop logstash rm /var/lib/logstash/sincedb_postgres systemctl start logstash ``` --- ## Verification Checklist After making any changes, verify the pipeline is working: - [ ] Logstash is running: `systemctl status logstash` - [ ] Configuration is valid: `/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf` - [ ] No grok failures: `curl localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | .failures'` - [ ] Events being processed: `curl localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.events'` - [ ] Test error appears in Bugsink: Trigger a database function error and check Bugsink UI --- ## Test Database Function Error To generate a test error for verification: ```bash # Connect to production database sudo -u postgres psql -d flyer-crawler-prod # Trigger an error (achievement not found) SELECT award_achievement('00000000-0000-0000-0000-000000000001'::uuid, 'Nonexistent Badge'); \q ``` **Expected flow:** 1. PostgreSQL logs the error to `/var/log/postgresql/postgresql-YYYY-MM-DD.log` 2. Logstash reads and parses the log (within ~30 seconds) 3. Error appears in Bugsink project 1 (production) **If error doesn't appear:** - Check each diagnostic step above - Review Logstash logs: `journalctl -u logstash -f` --- ## Related Documentation - **Setup Guide**: [docs/BARE-METAL-SETUP.md](BARE-METAL-SETUP.md) - PostgreSQL Function Observability section - **Architecture**: [docs/adr/0050-postgresql-function-observability.md](adr/0050-postgresql-function-observability.md) - **Configuration Reference**: [CLAUDE.md](../CLAUDE.md) - Logstash Configuration section - **Bugsink MCP Server**: [CLAUDE.md](../CLAUDE.md) - Sentry/Bugsink MCP Server Setup section