All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 19m12s
543 lines
14 KiB
Markdown
543 lines
14 KiB
Markdown
# Logstash Troubleshooting Runbook
|
|
|
|
This runbook provides step-by-step diagnostics and solutions for common Logstash issues in the PostgreSQL observability pipeline (ADR-050).
|
|
|
|
## Quick Reference
|
|
|
|
| Symptom | Most Likely Cause | Quick Check |
|
|
| ------------------------ | ---------------------------- | ------------------------------------- |
|
|
| No errors in Bugsink | Logstash not running | `systemctl status logstash` |
|
|
| Events not processed | Grok pattern mismatch | Check filter failures in stats |
|
|
| Wrong Bugsink project | Environment detection failed | Verify `pg_database` field extraction |
|
|
| 403 authentication error | Missing/wrong DSN key | Check `X-Sentry-Auth` header |
|
|
| 500 error from Bugsink | Invalid event format | Verify `event_id` and required fields |
|
|
| varchar(7) constraint | Unresolved `%{sentry_level}` | Add Ruby filter for level validation |
|
|
|
|
---
|
|
|
|
## Diagnostic Steps
|
|
|
|
### 1. Verify Logstash is Running
|
|
|
|
```bash
|
|
# Check service status
|
|
systemctl status logstash
|
|
|
|
# If stopped, start it
|
|
systemctl start logstash
|
|
|
|
# View recent logs
|
|
journalctl -u logstash -n 50 --no-pager
|
|
```
|
|
|
|
**Expected output:**
|
|
|
|
- Status: `active (running)`
|
|
- No error messages in recent logs
|
|
|
|
---
|
|
|
|
### 2. Check Configuration Syntax
|
|
|
|
```bash
|
|
# Test configuration file
|
|
/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf
|
|
```
|
|
|
|
**Expected output:**
|
|
|
|
```
|
|
Configuration OK
|
|
```
|
|
|
|
**If syntax errors:**
|
|
|
|
1. Review error message for line number
|
|
2. Check for missing braces, quotes, or commas
|
|
3. Verify plugin names are correct (e.g., `json`, `grok`, `uuid`, `http`)
|
|
|
|
---
|
|
|
|
### 3. Verify PostgreSQL Logs Are Being Read
|
|
|
|
```bash
|
|
# Check if log file exists and has content
|
|
ls -lh /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log
|
|
|
|
# Check Logstash can read the file
|
|
sudo -u logstash cat /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | head -10
|
|
```
|
|
|
|
**Expected output:**
|
|
|
|
- Log file exists and is not empty
|
|
- Logstash user can read the file without permission errors
|
|
|
|
**If permission denied:**
|
|
|
|
```bash
|
|
# Check Logstash is in postgres group
|
|
groups logstash
|
|
|
|
# Should show: logstash : logstash adm postgres
|
|
|
|
# If not, add to group
|
|
usermod -a -G postgres logstash
|
|
systemctl restart logstash
|
|
```
|
|
|
|
---
|
|
|
|
### 4. Check Logstash Pipeline Stats
|
|
|
|
```bash
|
|
# Get pipeline statistics
|
|
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty' | jq '.pipelines.main.plugins.filters'
|
|
```
|
|
|
|
**Key metrics to check:**
|
|
|
|
1. **Grok filter events:**
|
|
- `"events.in"` - Total events received
|
|
- `"events.out"` - Events successfully parsed
|
|
- `"failures"` - Events that failed to parse
|
|
|
|
**If failures > 0:** Grok pattern doesn't match log format. Check PostgreSQL log format.
|
|
|
|
2. **JSON filter events:**
|
|
- `"events.in"` - Events received by JSON parser
|
|
- `"events.out"` - Successfully parsed JSON
|
|
|
|
**If events.in = 0:** Regex check `pg_message =~ /^\{/` is not matching. Verify fn_log() output format.
|
|
|
|
3. **UUID filter events:**
|
|
- Should match number of errors being forwarded
|
|
|
|
---
|
|
|
|
### 5. Test Grok Pattern Manually
|
|
|
|
```bash
|
|
# Get a sample log line
|
|
tail -1 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log
|
|
|
|
# Example expected format:
|
|
# 2026-01-20 10:30:00 +05 [12345] flyer_crawler_prod@flyer-crawler-prod WARNING: {"level":"WARNING","source":"postgresql",...}
|
|
```
|
|
|
|
**Pattern breakdown:**
|
|
|
|
```
|
|
%{TIMESTAMP_ISO8601:pg_timestamp} # 2026-01-20 10:30:00
|
|
[+-]%{INT:pg_timezone} # +05
|
|
\[%{POSINT:pg_pid}\] # [12345]
|
|
%{DATA:pg_user}@%{DATA:pg_database} # flyer_crawler_prod@flyer-crawler-prod
|
|
%{WORD:pg_level}: # WARNING:
|
|
%{GREEDYDATA:pg_message} # (rest of line)
|
|
```
|
|
|
|
**If pattern doesn't match:**
|
|
|
|
1. Check PostgreSQL `log_line_prefix` setting in `/etc/postgresql/14/main/conf.d/observability.conf`
|
|
2. Should be: `log_line_prefix = '%t [%p] %u@%d '`
|
|
3. Restart PostgreSQL if changed: `systemctl restart postgresql`
|
|
|
|
---
|
|
|
|
### 6. Verify Environment Detection
|
|
|
|
```bash
|
|
# Check recent PostgreSQL logs for database field
|
|
tail -20 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep -E "flyer-crawler-(prod|test)"
|
|
```
|
|
|
|
**Expected:**
|
|
|
|
- Production database: `flyer_crawler_prod@flyer-crawler-prod`
|
|
- Test database: `flyer_crawler_test@flyer-crawler-test`
|
|
|
|
**If database name doesn't match:**
|
|
|
|
- Check database connection string in application
|
|
- Verify `DB_DATABASE_PROD` and `DB_DATABASE_TEST` Gitea secrets
|
|
|
|
---
|
|
|
|
### 7. Test Bugsink API Connection
|
|
|
|
```bash
|
|
# Test production endpoint
|
|
curl -X POST https://bugsink.projectium.com/api/1/store/ \
|
|
-H "X-Sentry-Auth: Sentry sentry_version=7, sentry_client=test/1.0, sentry_key=911aef02b9a548fa8fabb8a3c81abfe5" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"event_id": "12345678901234567890123456789012",
|
|
"timestamp": "2026-01-20T10:30:00Z",
|
|
"platform": "other",
|
|
"level": "error",
|
|
"logger": "test",
|
|
"message": "Test error from troubleshooting"
|
|
}'
|
|
```
|
|
|
|
**Expected response:**
|
|
|
|
- HTTP 200 OK
|
|
- Response body: `{"id": "..."}`
|
|
|
|
**If 403 Forbidden:**
|
|
|
|
- DSN key is wrong in `/etc/logstash/conf.d/bugsink.conf`
|
|
- Get correct key from Bugsink UI: Settings → Projects → DSN
|
|
|
|
**If 500 Internal Server Error:**
|
|
|
|
- Missing required fields (event_id, timestamp, level)
|
|
- Check `mapping` section in Logstash config
|
|
|
|
---
|
|
|
|
### 8. Monitor Logstash Output in Real-Time
|
|
|
|
```bash
|
|
# Watch Logstash processing logs
|
|
journalctl -u logstash -f
|
|
```
|
|
|
|
**What to look for:**
|
|
|
|
- `"response code => 200"` - Successful forwarding to Bugsink
|
|
- `"response code => 403"` - Authentication failure
|
|
- `"response code => 500"` - Invalid event format
|
|
- Grok parse failures
|
|
|
|
---
|
|
|
|
## Common Issues and Solutions
|
|
|
|
### Issue 1: Grok Pattern Parse Failures
|
|
|
|
**Symptoms:**
|
|
|
|
- Logstash stats show increasing `"failures"` count
|
|
- No events reaching Bugsink
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty' | jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | .failures'
|
|
```
|
|
|
|
**Solution:**
|
|
|
|
1. Check PostgreSQL log format matches expected pattern
|
|
2. Verify `log_line_prefix` in PostgreSQL config
|
|
3. Test with sample log line using Grok Debugger (Kibana Dev Tools)
|
|
|
|
---
|
|
|
|
### Issue 2: JSON Filter Not Parsing fn_log() Output
|
|
|
|
**Symptoms:**
|
|
|
|
- Grok parses successfully but JSON filter shows 0 events
|
|
- `[fn_log]` fields missing in Logstash output
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# Check if pg_message field contains JSON
|
|
tail -20 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep "WARNING:" | grep "{"
|
|
```
|
|
|
|
**Solution:**
|
|
|
|
1. Verify `fn_log()` function exists in database:
|
|
```sql
|
|
\df fn_log
|
|
```
|
|
2. Test `fn_log()` output format:
|
|
```sql
|
|
SELECT fn_log('WARNING', 'test', 'Test message', '{"key":"value"}'::jsonb);
|
|
```
|
|
3. Check logs show JSON output starting with `{`
|
|
|
|
---
|
|
|
|
### Issue 3: Events Going to Wrong Bugsink Project
|
|
|
|
**Symptoms:**
|
|
|
|
- Production errors appear in test project (or vice versa)
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# Check database name detection in recent logs
|
|
tail -50 /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log | grep -E "(flyer-crawler-prod|flyer-crawler-test)"
|
|
```
|
|
|
|
**Solution:**
|
|
|
|
1. Verify database names in filter section match actual database names
|
|
2. Check `pg_database` field is correctly extracted by grok pattern:
|
|
```bash
|
|
# Enable debug output in Logstash config temporarily
|
|
stdout { codec => rubydebug { metadata => true } }
|
|
```
|
|
3. Verify environment tagging in filter:
|
|
- `pg_database == "flyer-crawler-prod"` → adds "production" tag → routes to project 1
|
|
- `pg_database == "flyer-crawler-test"` → adds "test" tag → routes to project 3
|
|
|
|
---
|
|
|
|
### Issue 4: 403 Authentication Errors from Bugsink
|
|
|
|
**Symptoms:**
|
|
|
|
- Logstash logs show `response code => 403`
|
|
- Events not appearing in Bugsink
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# Check Logstash output logs for authentication errors
|
|
journalctl -u logstash -n 100 | grep "403"
|
|
```
|
|
|
|
**Solution:**
|
|
|
|
1. Verify DSN key in `/etc/logstash/conf.d/bugsink.conf` matches Bugsink project
|
|
2. Get correct DSN from Bugsink UI:
|
|
- Navigate to Settings → Projects → Click project
|
|
- Copy "DSN" value
|
|
- Extract key: `http://KEY@host/PROJECT_ID` → use KEY
|
|
3. Update `X-Sentry-Auth` header in Logstash config:
|
|
```conf
|
|
"X-Sentry-Auth" => "Sentry sentry_version=7, sentry_client=logstash/1.0, sentry_key=YOUR_KEY_HERE"
|
|
```
|
|
4. Restart Logstash: `systemctl restart logstash`
|
|
|
|
---
|
|
|
|
### Issue 5: 500 Errors from Bugsink
|
|
|
|
**Symptoms:**
|
|
|
|
- Logstash logs show `response code => 500`
|
|
- Bugsink logs show validation errors
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# Check Bugsink logs for details
|
|
docker logs bugsink-web 2>&1 | tail -50
|
|
```
|
|
|
|
**Common causes:**
|
|
|
|
1. Missing `event_id` field
|
|
2. Invalid timestamp format
|
|
3. Missing required Sentry fields
|
|
|
|
**Solution:**
|
|
|
|
1. Verify `uuid` filter is generating `event_id`:
|
|
```conf
|
|
uuid {
|
|
target => "[@metadata][event_id]"
|
|
overwrite => true
|
|
}
|
|
```
|
|
2. Check `mapping` section includes all required fields:
|
|
- `event_id` (UUID)
|
|
- `timestamp` (ISO 8601)
|
|
- `platform` (string)
|
|
- `level` (error/warning/info)
|
|
- `logger` (string)
|
|
- `message` (string)
|
|
|
|
---
|
|
|
|
### Issue 6: High Memory Usage by Logstash
|
|
|
|
**Symptoms:**
|
|
|
|
- Server running out of memory
|
|
- Logstash OOM killed
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# Check Logstash memory usage
|
|
ps aux | grep logstash
|
|
systemctl status logstash
|
|
```
|
|
|
|
**Solution:**
|
|
|
|
1. Limit Logstash heap size in `/etc/logstash/jvm.options`:
|
|
```
|
|
-Xms1g
|
|
-Xmx1g
|
|
```
|
|
2. Restart Logstash: `systemctl restart logstash`
|
|
3. Monitor with: `top -p $(pgrep -f logstash)`
|
|
|
|
---
|
|
|
|
### Issue 7: Level Field Constraint Violation (varchar(7))
|
|
|
|
**Symptoms:**
|
|
|
|
- Bugsink returns HTTP 500 errors
|
|
- PostgreSQL errors: `value too long for type character varying(7)`
|
|
- Events fail to insert with literal `%{sentry_level}` string (16 characters)
|
|
|
|
**Root Cause:**
|
|
|
|
When Logstash cannot determine the log level (no error patterns matched), the `sentry_level` field remains as the unresolved placeholder `%{sentry_level}`. Bugsink's PostgreSQL schema has a `varchar(7)` constraint on the level field.
|
|
|
|
Valid Sentry levels (all <= 7 characters): `fatal`, `error`, `warning`, `info`, `debug`
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# Check for HTTP 500 responses in Logstash logs
|
|
podman exec flyer-crawler-dev cat /var/log/logstash/logstash.log | grep "500"
|
|
|
|
# Check Bugsink for constraint violation errors
|
|
# Via MCP:
|
|
mcp__localerrors__list_issues({ project_id: 1, status: 'unresolved' })
|
|
```
|
|
|
|
**Solution:**
|
|
|
|
Add a Ruby filter block in `docker/logstash/bugsink.conf` to validate and normalize the `sentry_level` field before sending to Bugsink:
|
|
|
|
```ruby
|
|
# Add this AFTER all mutate filters that set sentry_level
|
|
# and BEFORE the output section
|
|
|
|
ruby {
|
|
code => '
|
|
level = event.get("sentry_level")
|
|
# Check if level is invalid (nil, empty, contains placeholder, or too long)
|
|
if level.nil? || level.to_s.empty? || level.to_s.include?("%{") || level.to_s.length > 7
|
|
# Default to "error" for error-tagged events, "info" otherwise
|
|
if event.get("tags")&.include?("error")
|
|
event.set("sentry_level", "error")
|
|
else
|
|
event.set("sentry_level", "info")
|
|
end
|
|
else
|
|
# Normalize to lowercase and validate
|
|
normalized = level.to_s.downcase
|
|
valid_levels = ["fatal", "error", "warning", "info", "debug"]
|
|
unless valid_levels.include?(normalized)
|
|
normalized = "error"
|
|
end
|
|
event.set("sentry_level", normalized)
|
|
end
|
|
'
|
|
}
|
|
```
|
|
|
|
**Key validations performed:**
|
|
|
|
1. Checks for nil or empty values
|
|
2. Detects unresolved placeholders (`%{...}`)
|
|
3. Enforces 7-character maximum length
|
|
4. Normalizes to lowercase
|
|
5. Validates against allowed Sentry levels
|
|
6. Defaults to "error" for error-tagged events, "info" otherwise
|
|
|
|
**Verification:**
|
|
|
|
```bash
|
|
# Restart Logstash
|
|
podman exec flyer-crawler-dev systemctl restart logstash
|
|
|
|
# Generate a test log that triggers the filter
|
|
podman exec flyer-crawler-dev pm2 restart flyer-crawler-api-dev
|
|
|
|
# Check no new HTTP 500 errors
|
|
podman exec flyer-crawler-dev cat /var/log/logstash/logstash.log | tail -50 | grep -E "(500|error)"
|
|
```
|
|
|
|
---
|
|
|
|
### Issue 8: Log File Rotation Issues
|
|
|
|
**Symptoms:**
|
|
|
|
- Logstash stops processing after log file rotates
|
|
- Sincedb file pointing to old inode
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# Check sincedb file
|
|
cat /var/lib/logstash/sincedb_postgres
|
|
|
|
# Check current log file inode
|
|
ls -li /var/log/postgresql/postgresql-$(date +%Y-%m-%d).log
|
|
```
|
|
|
|
**Solution:**
|
|
|
|
1. Logstash should automatically detect rotation
|
|
2. If stuck, delete sincedb file (will reprocess recent logs):
|
|
```bash
|
|
systemctl stop logstash
|
|
rm /var/lib/logstash/sincedb_postgres
|
|
systemctl start logstash
|
|
```
|
|
|
|
---
|
|
|
|
## Verification Checklist
|
|
|
|
After making any changes, verify the pipeline is working:
|
|
|
|
- [ ] Logstash is running: `systemctl status logstash`
|
|
- [ ] Configuration is valid: `/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/bugsink.conf`
|
|
- [ ] No grok failures: `curl localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.plugins.filters[] | select(.name == "grok") | .failures'`
|
|
- [ ] Events being processed: `curl localhost:9600/_node/stats/pipelines?pretty | jq '.pipelines.main.events'`
|
|
- [ ] Test error appears in Bugsink: Trigger a database function error and check Bugsink UI
|
|
|
|
---
|
|
|
|
## Test Database Function Error
|
|
|
|
To generate a test error for verification:
|
|
|
|
```bash
|
|
# Connect to production database
|
|
sudo -u postgres psql -d flyer-crawler-prod
|
|
|
|
# Trigger an error (achievement not found)
|
|
SELECT award_achievement('00000000-0000-0000-0000-000000000001'::uuid, 'Nonexistent Badge');
|
|
\q
|
|
```
|
|
|
|
**Expected flow:**
|
|
|
|
1. PostgreSQL logs the error to `/var/log/postgresql/postgresql-YYYY-MM-DD.log`
|
|
2. Logstash reads and parses the log (within ~30 seconds)
|
|
3. Error appears in Bugsink project 1 (production)
|
|
|
|
**If error doesn't appear:**
|
|
|
|
- Check each diagnostic step above
|
|
- Review Logstash logs: `journalctl -u logstash -f`
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- **Setup Guide**: [docs/BARE-METAL-SETUP.md](BARE-METAL-SETUP.md) - PostgreSQL Function Observability section
|
|
- **Architecture**: [docs/adr/0050-postgresql-function-observability.md](adr/0050-postgresql-function-observability.md)
|
|
- **Configuration Reference**: [CLAUDE.md](../CLAUDE.md) - Logstash Configuration section
|
|
- **Bugsink MCP Server**: [CLAUDE.md](../CLAUDE.md) - Sentry/Bugsink MCP Server Setup section
|