14 KiB
Logstash Troubleshooting Runbook
This runbook provides step-by-step diagnostics and solutions for common Logstash issues in the PostgreSQL observability pipeline (ADR-050).
Quick Reference
| Symptom | Most Likely Cause | Quick Check |
|---|---|---|
| No errors in Bugsink | Logstash not running | systemctl status logstash |
| Events not processed | Grok pattern mismatch | Check filter failures in stats |
| Wrong Bugsink project | Environment detection failed | Verify pg_database field extraction |
| 403 authentication error | Missing/wrong DSN key | Check X-Sentry-Auth header |
| 500 error from Bugsink | Invalid event format | Verify event_id and required fields |
| varchar(7) constraint | Unresolved %{sentry_level} |
Add Ruby filter for level validation |
Diagnostic Steps
1. Verify Logstash is Running
# Check service status
systemctl status logstash
# If stopped, start it
systemctl start logstash
# View recent logs
journalctl -u logstash -n 50 --no-pager
Expected output:
- Status:
active (running) - No error messages in recent logs
2. Check Configuration Syntax
# Test configuration file
/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf
Expected output:
Configuration OK
If syntax errors:
- Review error message for line number
- Check for missing braces, quotes, or commas
- Verify plugin names are correct (e.g.,
json,grok,uuid,http)
3. Verify PostgreSQL Logs Are Being Read
# Check if log file exists and has content
ls -lh /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log
# Check Logstash can read the file
sudo -u logstash cat /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | head -10
Expected output:
- Log file exists and is not empty
- Logstash user can read the file without permission errors
If permission denied:
# Check Logstash is in postgres group
groups logstash
# Should show: logstash : logstash adm postgres
# If not, add to group
usermod -a -G postgres logstash
systemctl restart logstash
4. Check Logstash Pipeline Stats
# Get pipeline statistics
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty' | jq '.pipelines.main.plugins.filters'
Key metrics to check:
-
Grok filter events:
"events.in"- Total events received"events.out"- Events successfully parsed"failures"- Events that failed to parse
If failures > 0: Grok pattern doesn't match log format. Check PostgreSQL log format.
-
JSON filter events:
"events.in"- Events received by JSON parser"events.out"- Successfully parsed JSON
If events.in = 0: Regex check
pg_message =~ /^\{/is not matching. Verify fn_log() output format. -
UUID filter events:
- Should match number of errors being forwarded
5. Test Grok Pattern Manually
# Get a sample log line
tail -1 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log
# Example expected format:
# 2026-01-20 10:30:00 +05 [12345] flyer_crawler_prod@flyer-crawler-prod WARNING: {"level":"WARNING","source":"postgresql",...}
Pattern breakdown:
%{TIMESTAMP_ISO8601:pg_timestamp} # 2026-01-20 10:30:00
[+-]%{INT:pg_timezone} # +05
\[%{POSINT:pg_pid}\] # [12345]
%{DATA:pg_user}@%{DATA:pg_database} # flyer_crawler_prod@flyer-crawler-prod
%{WORD:pg_level}: # WARNING:
%{GREEDYDATA:pg_message} # (rest of line)
If pattern doesn't match:
- Check PostgreSQL
log_line_prefixsetting in/etc/postgresql/14/main/conf.d/observability.conf - Should be:
log_line_prefix = '%t [%p] %u@%d ' - Restart PostgreSQL if changed:
systemctl restart postgresql
6. Verify Environment Detection
# Check recent PostgreSQL logs for database field
tail -20 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep -E "flyer-crawler-(prod|test)"
Expected:
- Production database:
flyer_crawler_prod@flyer-crawler-prod - Test database:
flyer_crawler_test@flyer-crawler-test
If database name doesn't match:
- Check database connection string in application
- Verify
DB_DATABASE_PRODandDB_DATABASE_TESTGitea secrets
7. Test Bugsink API Connection
# Test production endpoint
curl -X POST https://bugsink.projectium.com/api/1/store/ \
-H "X-Sentry-Auth: Sentry sentry_version=7, sentry_client=test/1.0, sentry_key=911aef02b9a548fa8fabb8a3c81abfe5" \
-H "Content-Type: application/json" \
-d '{
"event_id": "12345678901234567890123456789012",
"timestamp": "2026-01-20T10:30:00Z",
"platform": "other",
"level": "error",
"logger": "test",
"message": "Test error from troubleshooting"
}'
Expected response:
- HTTP 200 OK
- Response body:
{"id": "..."}
If 403 Forbidden:
- DSN key is wrong in
/etc/logstash/conf.d/bugsink.conf - Get correct key from Bugsink UI: Settings → Projects → DSN
If 500 Internal Server Error:
- Missing required fields (event_id, timestamp, level)
- Check
mappingsection in Logstash config
8. Monitor Logstash Output in Real-Time
# Watch Logstash processing logs
journalctl -u logstash -f
What to look for:
"response code => 200"- Successful forwarding to Bugsink"response code => 403"- Authentication failure"response code => 500"- Invalid event format- Grok parse failures
Common Issues and Solutions
Issue 1: Grok Pattern Parse Failures
Symptoms:
- Logstash stats show increasing
"failures"count - No events reaching Bugsink
Diagnosis:
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty' | jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | .failures'
Solution:
- Check PostgreSQL log format matches expected pattern
- Verify
log_line_prefixin PostgreSQL config - Test with sample log line using Grok Debugger (Kibana Dev Tools)
Issue 2: JSON Filter Not Parsing fn_log() Output
Symptoms:
- Grok parses successfully but JSON filter shows 0 events
[fn_log]fields missing in Logstash output
Diagnosis:
# Check if pg_message field contains JSON
tail -20 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep "WARNING:" | grep "{"
Solution:
- Verify
fn_log()function exists in database:\df fn_log - Test
fn_log()output format:SELECT fn_log('WARNING', 'test', 'Test message', '{"key":"value"}'::jsonb); - Check logs show JSON output starting with
{
Issue 3: Events Going to Wrong Bugsink Project
Symptoms:
- Production errors appear in test project (or vice versa)
Diagnosis:
# Check database name detection in recent logs
tail -50 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep -E "(flyer-crawler-prod|flyer-crawler-test)"
Solution:
- Verify database names in filter section match actual database names
- Check
pg_databasefield is correctly extracted by grok pattern:# Enable debug output in Logstash config temporarily stdout { codec => rubydebug { metadata => true } } - Verify environment tagging in filter:
pg_database == "flyer-crawler-prod"→ adds "production" tag → routes to project 1pg_database == "flyer-crawler-test"→ adds "test" tag → routes to project 3
Issue 4: 403 Authentication Errors from Bugsink
Symptoms:
- Logstash logs show
response code => 403 - Events not appearing in Bugsink
Diagnosis:
# Check Logstash output logs for authentication errors
journalctl -u logstash -n 100 | grep "403"
Solution:
- Verify DSN key in
/etc/logstash/conf.d/bugsink.confmatches Bugsink project - Get correct DSN from Bugsink UI:
- Navigate to Settings → Projects → Click project
- Copy "DSN" value
- Extract key:
http://KEY@host/PROJECT_ID→ use KEY
- Update
X-Sentry-Authheader in Logstash config:"X-Sentry-Auth" => "Sentry sentry_version=7, sentry_client=logstash/1.0, sentry_key=YOUR_KEY_HERE" - Restart Logstash:
systemctl restart logstash
Issue 5: 500 Errors from Bugsink
Symptoms:
- Logstash logs show
response code => 500 - Bugsink logs show validation errors
Diagnosis:
# Check Bugsink logs for details
docker logs bugsink-web 2>&1 | tail -50
Common causes:
- Missing
event_idfield - Invalid timestamp format
- Missing required Sentry fields
Solution:
- Verify
uuidfilter is generatingevent_id:uuid { target => "[@metadata][event_id]" overwrite => true } - Check
mappingsection includes all required fields:event_id(UUID)timestamp(ISO 8601)platform(string)level(error/warning/info)logger(string)message(string)
Issue 6: High Memory Usage by Logstash
Symptoms:
- Server running out of memory
- Logstash OOM killed
Diagnosis:
# Check Logstash memory usage
ps aux | grep logstash
systemctl status logstash
Solution:
- Limit Logstash heap size in
/etc/logstash/jvm.options:-Xms1g -Xmx1g - Restart Logstash:
systemctl restart logstash - Monitor with:
top -p $(pgrep -f logstash)
Issue 7: Level Field Constraint Violation (varchar(7))
Symptoms:
- Bugsink returns HTTP 500 errors
- PostgreSQL errors:
value too long for type character varying(7) - Events fail to insert with literal
%{sentry_level}string (16 characters)
Root Cause:
When Logstash cannot determine the log level (no error patterns matched), the sentry_level field remains as the unresolved placeholder %{sentry_level}. Bugsink's PostgreSQL schema has a varchar(7) constraint on the level field.
Valid Sentry levels (all <= 7 characters): fatal, error, warning, info, debug
Diagnosis:
# Check for HTTP 500 responses in Logstash logs
podman exec flyer-crawler-dev cat /var/log/logstash/logstash.log | grep "500"
# Check Bugsink for constraint violation errors
# Via MCP:
mcp__localerrors__list_issues({ project_id: 1, status: 'unresolved' })
Solution:
Add a Ruby filter block in docker/logstash/bugsink.conf to validate and normalize the sentry_level field before sending to Bugsink:
# Add this AFTER all mutate filters that set sentry_level
# and BEFORE the output section
ruby {
code => '
level = event.get("sentry_level")
# Check if level is invalid (nil, empty, contains placeholder, or too long)
if level.nil? || level.to_s.empty? || level.to_s.include?("%{") || level.to_s.length > 7
# Default to "error" for error-tagged events, "info" otherwise
if event.get("tags")&.include?("error")
event.set("sentry_level", "error")
else
event.set("sentry_level", "info")
end
else
# Normalize to lowercase and validate
normalized = level.to_s.downcase
valid_levels = ["fatal", "error", "warning", "info", "debug"]
unless valid_levels.include?(normalized)
normalized = "error"
end
event.set("sentry_level", normalized)
end
'
}
Key validations performed:
- Checks for nil or empty values
- Detects unresolved placeholders (
%{...}) - Enforces 7-character maximum length
- Normalizes to lowercase
- Validates against allowed Sentry levels
- Defaults to "error" for error-tagged events, "info" otherwise
Verification:
# Restart Logstash
podman exec flyer-crawler-dev systemctl restart logstash
# Generate a test log that triggers the filter
podman exec flyer-crawler-dev pm2 restart flyer-crawler-api-dev
# Check no new HTTP 500 errors
podman exec flyer-crawler-dev cat /var/log/logstash/logstash.log | tail -50 | grep -E "(500|error)"
Issue 8: Log File Rotation Issues
Symptoms:
- Logstash stops processing after log file rotates
- Sincedb file pointing to old inode
Diagnosis:
# Check sincedb file
cat /var/lib/logstash/sincedb_postgres
# Check current log file inode
ls -li /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log
Solution:
- Logstash should automatically detect rotation
- If stuck, delete sincedb file (will reprocess recent logs):
systemctl stop logstash rm /var/lib/logstash/sincedb_postgres systemctl start logstash
Verification Checklist
After making any changes, verify the pipeline is working:
- Logstash is running:
systemctl status logstash - Configuration is valid:
/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf - No grok failures:
curl localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | .failures' - Events being processed:
curl localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.events' - Test error appears in Bugsink: Trigger a database function error and check Bugsink UI
Test Database Function Error
To generate a test error for verification:
# Connect to production database
sudo -u postgres psql -d flyer-crawler-prod
# Trigger an error (achievement not found)
SELECT award_achievement('00000000-0000-0000-0000-000000000001'::uuid, 'Nonexistent Badge');
\q
Expected flow:
- PostgreSQL logs the error to
/var/log/postgresql/postgresql-YYYY-MM-DD.log - Logstash reads and parses the log (within ~30 seconds)
- Error appears in Bugsink project 1 (production)
If error doesn't appear:
- Check each diagnostic step above
- Review Logstash logs:
journalctl -u logstash -f
Related Documentation
- Setup Guide: docs/BARE-METAL-SETUP.md - PostgreSQL Function Observability section
- Architecture: docs/adr/0050-postgresql-function-observability.md
- Configuration Reference: CLAUDE.md - Logstash Configuration section
- Bugsink MCP Server: CLAUDE.md - Sentry/Bugsink MCP Server Setup section