Compare commits
2 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
d06a1952a0 | ||
| 4d323a51ca |
517
docs/adr/0058-browser-test-performance-optimization.md
Normal file
517
docs/adr/0058-browser-test-performance-optimization.md
Normal file
@@ -0,0 +1,517 @@
|
||||
# ADR-0042: Browser Test Performance Optimization
|
||||
|
||||
**Status**: Accepted
|
||||
**Date**: 2026-02-10
|
||||
**Authors**: Claude Code AI Agent
|
||||
|
||||
## Context
|
||||
|
||||
### Current State
|
||||
|
||||
The stock-alert project has 64 Playwright browser tests across 5 spec files taking approximately 240 seconds (~4 minutes) to execute. Analysis reveals three major performance bottlenecks:
|
||||
|
||||
| Metric | Count | Impact |
|
||||
| ----------------------------------- | ----- | -------------------------------------------- |
|
||||
| Hardcoded `waitForTimeout()` calls | 66 | ~120s cumulative wait time |
|
||||
| Redundant login calls per test | 43 | ~2-3s each = 86-129s overhead |
|
||||
| Visual regression tests blocking CI | 4 | Cannot run in parallel with functional tests |
|
||||
|
||||
### Test Distribution
|
||||
|
||||
| File | Tests | `waitForTimeout` Calls | `login()` Calls |
|
||||
| ------------------- | ------ | ---------------------- | ------------------------ |
|
||||
| `dashboard.spec.js` | 10 | 8 | 10 |
|
||||
| `alerts.spec.js` | 14 | 25 | 1 (beforeEach) |
|
||||
| `gaps.spec.js` | 20 | 29 | 1 (beforeEach) |
|
||||
| `login.spec.js` | 11 | 4 | 0 (tests login itself) |
|
||||
| `visual.spec.js` | 4 | 0 | 4 (via navigateWithAuth) |
|
||||
| **Total** | **59** | **66** | **16 patterns** |
|
||||
|
||||
### Root Causes
|
||||
|
||||
1. **Anti-Pattern: Hardcoded Timeouts**
|
||||
- `waitForTimeout(2000)` used to "wait for data to load"
|
||||
- Unnecessarily slow on fast systems, flaky on slow systems
|
||||
- No correlation to actual page readiness
|
||||
|
||||
2. **Anti-Pattern: Per-Test Authentication**
|
||||
- Each test navigates to `/login`, enters password, submits
|
||||
- Session cookie persists across requests but not across tests
|
||||
- `beforeEach` login adds 2-3 seconds per test
|
||||
|
||||
3. **Architecture: Mixed Test Types**
|
||||
- Visual regression tests require different infrastructure (baseline images)
|
||||
- Functional tests and visual tests compete for worker slots
|
||||
- Cannot optimize CI parallelization
|
||||
|
||||
### Requirements
|
||||
|
||||
1. Reduce test suite runtime by 40-55%
|
||||
2. Improve test determinism (eliminate flakiness)
|
||||
3. Maintain test coverage and reliability
|
||||
4. Enable parallel CI execution where possible
|
||||
5. Document patterns for other projects
|
||||
|
||||
## Decision
|
||||
|
||||
Implement three optimization phases:
|
||||
|
||||
### Phase 1: Event-Based Wait Replacement (Primary Impact: ~50% of time savings)
|
||||
|
||||
Replace all 66 `waitForTimeout()` calls with Playwright's event-based waiting APIs.
|
||||
|
||||
**Replacement Patterns:**
|
||||
|
||||
| Current Pattern | Replacement | Rationale |
|
||||
| --------------------------------------- | ------------------------------------------------- | ----------------------------- |
|
||||
| `waitForTimeout(2000)` after navigation | `waitForLoadState('networkidle')` | Waits for network quiescence |
|
||||
| `waitForTimeout(1000)` after click | `waitForSelector('.result')` | Waits for specific DOM change |
|
||||
| `waitForTimeout(3000)` for charts | `waitForSelector('canvas', { state: 'visible' })` | Waits for chart render |
|
||||
| `waitForTimeout(500)` for viewport | `waitForFunction(() => ...)` | Waits for layout reflow |
|
||||
|
||||
**Implementation Examples:**
|
||||
|
||||
```javascript
|
||||
// BEFORE: Hardcoded timeout
|
||||
await page.goto('/alerts');
|
||||
await page.waitForTimeout(2000);
|
||||
const rows = await page.locator('tbody tr').count();
|
||||
|
||||
// AFTER: Event-based wait
|
||||
await page.goto('/alerts');
|
||||
await page.waitForLoadState('networkidle');
|
||||
await page.waitForSelector('tbody tr', { state: 'attached' });
|
||||
const rows = await page.locator('tbody tr').count();
|
||||
```
|
||||
|
||||
```javascript
|
||||
// BEFORE: Hardcoded timeout after action
|
||||
await page.click('#runCheckBtn');
|
||||
await page.waitForTimeout(2000);
|
||||
|
||||
// AFTER: Wait for response
|
||||
const [response] = await Promise.all([
|
||||
page.waitForResponse((resp) => resp.url().includes('/api/check')),
|
||||
page.click('#runCheckBtn'),
|
||||
]);
|
||||
```
|
||||
|
||||
**Helper Function Addition to `helpers.js`:**
|
||||
|
||||
```javascript
|
||||
/**
|
||||
* Waits for page to be fully loaded with data.
|
||||
* Replaces hardcoded waitForTimeout calls.
|
||||
*/
|
||||
async function waitForPageReady(page, options = {}) {
|
||||
const { dataSelector = null, networkIdle = true, minTime = 0 } = options;
|
||||
|
||||
const promises = [];
|
||||
|
||||
if (networkIdle) {
|
||||
promises.push(page.waitForLoadState('networkidle'));
|
||||
}
|
||||
|
||||
if (dataSelector) {
|
||||
promises.push(page.waitForSelector(dataSelector, { state: 'visible' }));
|
||||
}
|
||||
|
||||
if (minTime > 0) {
|
||||
promises.push(page.waitForTimeout(minTime)); // Escape hatch for animations
|
||||
}
|
||||
|
||||
await Promise.all(promises);
|
||||
}
|
||||
```
|
||||
|
||||
**Estimated Time Savings:** 60-80 seconds (eliminates ~120s of cumulative waits, but event waits have overhead)
|
||||
|
||||
### Phase 2: Global Authentication Setup (Primary Impact: ~35% of time savings)
|
||||
|
||||
Share authenticated session across all tests using Playwright's global setup feature.
|
||||
|
||||
**Architecture:**
|
||||
|
||||
```
|
||||
┌──────────────────┐
|
||||
│ global-setup.js │
|
||||
│ │
|
||||
│ 1. Login once │
|
||||
│ 2. Save storage │
|
||||
└────────┬─────────┘
|
||||
│
|
||||
┌──────────────────────┼──────────────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ dashboard.spec │ │ alerts.spec │ │ gaps.spec │
|
||||
│ (reuses auth) │ │ (reuses auth) │ │ (reuses auth) │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
**Implementation Files:**
|
||||
|
||||
**`tests/browser/global-setup.js`:**
|
||||
|
||||
```javascript
|
||||
const { chromium } = require('@playwright/test');
|
||||
const path = require('path');
|
||||
|
||||
const authFile = path.join(__dirname, '.auth', 'user.json');
|
||||
|
||||
module.exports = async function globalSetup() {
|
||||
const browser = await chromium.launch();
|
||||
const page = await browser.newPage();
|
||||
|
||||
// Only perform login if authentication is enabled
|
||||
if (process.env.DASHBOARD_PASSWORD) {
|
||||
await page.goto(process.env.PLAYWRIGHT_BASE_URL || 'http://localhost:8980');
|
||||
|
||||
// Perform login
|
||||
await page.goto('/login');
|
||||
await page.fill('#password', process.env.DASHBOARD_PASSWORD);
|
||||
await page.click('button[type="submit"]');
|
||||
await page.waitForURL('/');
|
||||
|
||||
// Save authentication state
|
||||
await page.context().storageState({ path: authFile });
|
||||
}
|
||||
|
||||
await browser.close();
|
||||
};
|
||||
```
|
||||
|
||||
**`playwright.config.js` Updates:**
|
||||
|
||||
```javascript
|
||||
module.exports = defineConfig({
|
||||
// ... existing config ...
|
||||
|
||||
// Global setup runs once before all tests
|
||||
globalSetup: require.resolve('./tests/browser/global-setup.js'),
|
||||
|
||||
projects: [
|
||||
{
|
||||
name: 'chromium',
|
||||
use: {
|
||||
...devices['Desktop Chrome'],
|
||||
// Reuse authentication state from global setup
|
||||
storageState: './tests/browser/.auth/user.json',
|
||||
},
|
||||
},
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
**Test File Updates:**
|
||||
|
||||
```javascript
|
||||
// BEFORE: Login in beforeEach
|
||||
test.beforeEach(async ({ page }) => {
|
||||
page.consoleErrors = captureConsoleErrors(page);
|
||||
if (isAuthEnabled()) {
|
||||
await login(page);
|
||||
}
|
||||
});
|
||||
|
||||
// AFTER: Remove login (handled by global setup)
|
||||
test.beforeEach(async ({ page }) => {
|
||||
page.consoleErrors = captureConsoleErrors(page);
|
||||
// Authentication already applied via storageState
|
||||
});
|
||||
```
|
||||
|
||||
**Estimated Time Savings:** 80-100 seconds (43 logins x ~2-3s each, minus 3s for global setup)
|
||||
|
||||
### Phase 3: Visual Test Separation (Primary Impact: CI parallelization)
|
||||
|
||||
Separate visual regression tests into a dedicated project for parallel CI execution.
|
||||
|
||||
**Project Configuration:**
|
||||
|
||||
```javascript
|
||||
// playwright.config.js
|
||||
module.exports = defineConfig({
|
||||
projects: [
|
||||
// Functional tests - fast, event-based
|
||||
{
|
||||
name: 'functional',
|
||||
testMatch: /^(?!.*visual).*\.spec\.js$/,
|
||||
use: {
|
||||
...devices['Desktop Chrome'],
|
||||
storageState: './tests/browser/.auth/user.json',
|
||||
},
|
||||
},
|
||||
// Visual tests - separate baseline management
|
||||
{
|
||||
name: 'visual',
|
||||
testMatch: '**/visual.spec.js',
|
||||
use: {
|
||||
...devices['Desktop Chrome'],
|
||||
storageState: './tests/browser/.auth/user.json',
|
||||
},
|
||||
// Different snapshot handling
|
||||
snapshotPathTemplate: '{testDir}/__screenshots__/{projectName}/{testFilePath}/{arg}{ext}',
|
||||
},
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
**CI Pipeline Updates:**
|
||||
|
||||
```yaml
|
||||
# .gitea/workflows/test.yml
|
||||
jobs:
|
||||
browser-functional:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- run: npx playwright test --project=functional
|
||||
|
||||
browser-visual:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- run: npx playwright test --project=visual
|
||||
```
|
||||
|
||||
**Estimated Time Savings:** 30-45 seconds (parallel execution vs sequential)
|
||||
|
||||
## Implementation Schedule
|
||||
|
||||
### Critical Path (Estimated 8-12 hours)
|
||||
|
||||
```
|
||||
Phase 1 (Event Waits) ████████████████ [4-6h]
|
||||
│
|
||||
Phase 2 (Global Auth) ████████ [2-3h]
|
||||
│
|
||||
Phase 3 (Visual Separation) ████ [2-3h]
|
||||
```
|
||||
|
||||
### Effort Summary
|
||||
|
||||
| Phase | Min Hours | Max Hours | Expected Savings |
|
||||
| ------------------------ | --------- | --------- | --------------------- |
|
||||
| 1. Event-Based Waits | 4 | 6 | 60-80s (25-33%) |
|
||||
| 2. Global Authentication | 2 | 3 | 80-100s (33-42%) |
|
||||
| 3. Visual Separation | 2 | 3 | 30-45s (CI parallel) |
|
||||
| **Total** | **8** | **12** | **170-225s (70-94%)** |
|
||||
|
||||
### Expected Results
|
||||
|
||||
| Metric | Before | After | Improvement |
|
||||
| ------------------ | ------ | --------- | ------------- |
|
||||
| Total Runtime | 240s | 110-140s | 42-54% faster |
|
||||
| Flaky Test Rate | ~5% | <1% | 80% reduction |
|
||||
| CI Parallelization | None | 2 workers | 2x throughput |
|
||||
| Login Operations | 43 | 1 | 98% reduction |
|
||||
| Hardcoded Waits | 66 | <5 | 92% reduction |
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
1. **Performance**: 40-55% reduction in test runtime
|
||||
2. **Reliability**: Event-based waits eliminate timing flakiness
|
||||
3. **Scalability**: Global setup pattern scales to N tests with O(1) login cost
|
||||
4. **CI Efficiency**: Parallel visual tests enable faster feedback loops
|
||||
5. **Maintainability**: Centralized auth logic reduces code duplication
|
||||
6. **Transferable Knowledge**: Patterns applicable to any Playwright project
|
||||
|
||||
### Negative
|
||||
|
||||
1. **Initial Migration Effort**: 8-12 hours of refactoring
|
||||
2. **Learning Curve**: Team must understand Playwright wait APIs
|
||||
3. **Global Setup Complexity**: Adds shared state between tests
|
||||
4. **Debugging Harder**: Shared auth can mask test isolation issues
|
||||
|
||||
### Mitigations
|
||||
|
||||
| Risk | Mitigation |
|
||||
| ------------------ | ------------------------------------------------------------- |
|
||||
| Global setup fails | Add retry logic; fallback to per-test login |
|
||||
| Event waits flaky | Keep small timeout buffer (100ms) as escape hatch |
|
||||
| Visual tests drift | Separate baseline management per environment |
|
||||
| Test isolation | Run `--project=functional` without auth for isolation testing |
|
||||
|
||||
### Neutral
|
||||
|
||||
- Test count unchanged (59 tests)
|
||||
- Coverage unchanged
|
||||
- Visual baselines unchanged (path changes only)
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### Alternative 1: Reduce Test Count
|
||||
|
||||
**Rejected:** Sacrifices coverage for speed. Tests exist for a reason.
|
||||
|
||||
### Alternative 2: Increase Worker Parallelism
|
||||
|
||||
**Rejected:** Server cannot handle >2 concurrent sessions reliably; creates resource contention.
|
||||
|
||||
### Alternative 3: Use `page.waitForTimeout()` with Shorter Durations
|
||||
|
||||
**Rejected:** Addresses symptom, not root cause. Still creates timing-dependent tests.
|
||||
|
||||
### Alternative 4: Cookie Injection Instead of Login
|
||||
|
||||
**Rejected:** Requires reverse-engineering session format; brittle if auth changes.
|
||||
|
||||
### Alternative 5: HTTP API Authentication (No Browser)
|
||||
|
||||
**Rejected:** Loses browser session behavior validation; tests login flow.
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Wait Replacement Mapping
|
||||
|
||||
| File | Current Timeouts | Replacement Strategy |
|
||||
| ------------------- | ---------------------- | ---------------------------------------------------------------------- |
|
||||
| `dashboard.spec.js` | 1000ms, 2000ms, 3000ms | `waitForSelector` for charts, `waitForLoadState` for navigation |
|
||||
| `alerts.spec.js` | 500ms, 1000ms, 2000ms | `waitForResponse` for API calls, `waitForSelector` for table rows |
|
||||
| `gaps.spec.js` | 500ms, 1000ms, 2000ms | `waitForResponse` for `/api/gaps`, `waitForSelector` for summary cards |
|
||||
| `login.spec.js` | 500ms, 2000ms | `waitForURL` for redirects, `waitForSelector` for error messages |
|
||||
|
||||
### Common Wait Patterns for This Codebase
|
||||
|
||||
| Scenario | Recommended Pattern | Example |
|
||||
| --------------------- | ------------------------------------------------- | ------------------------- |
|
||||
| After page navigation | `waitForLoadState('networkidle')` | Loading dashboard data |
|
||||
| After button click | `waitForResponse()` + `waitForSelector()` | Run Check button |
|
||||
| After filter change | `waitForResponse(/api\/.*/)` | Status filter dropdown |
|
||||
| For chart rendering | `waitForSelector('canvas', { state: 'visible' })` | Chart cards |
|
||||
| For modal appearance | `waitForSelector('.modal', { state: 'visible' })` | Confirmation dialogs |
|
||||
| For layout change | `waitForFunction()` | Responsive viewport tests |
|
||||
|
||||
### Auth Storage Structure
|
||||
|
||||
```
|
||||
tests/browser/
|
||||
├── .auth/
|
||||
│ └── user.json # Generated by global-setup, gitignored
|
||||
├── global-setup.js # Creates user.json
|
||||
├── dashboard.spec.js # Uses storageState
|
||||
├── alerts.spec.js
|
||||
├── gaps.spec.js
|
||||
├── login.spec.js # Tests login itself, may need special handling
|
||||
└── visual.spec.js
|
||||
```
|
||||
|
||||
**`.gitignore` Addition:**
|
||||
|
||||
```
|
||||
tests/browser/.auth/
|
||||
```
|
||||
|
||||
### Login.spec.js Special Handling
|
||||
|
||||
`login.spec.js` tests the login flow itself and must NOT use the shared auth state:
|
||||
|
||||
```javascript
|
||||
// playwright.config.js
|
||||
projects: [
|
||||
{
|
||||
name: 'functional',
|
||||
testMatch: /^(?!.*login).*\.spec\.js$/,
|
||||
use: { storageState: './tests/browser/.auth/user.json' },
|
||||
},
|
||||
{
|
||||
name: 'login',
|
||||
testMatch: '**/login.spec.js',
|
||||
use: { storageState: undefined }, // No auth - tests login flow
|
||||
},
|
||||
];
|
||||
```
|
||||
|
||||
## Testing the Optimization
|
||||
|
||||
### Baseline Measurement
|
||||
|
||||
```bash
|
||||
# Before optimization: establish baseline
|
||||
time npm run test:browser 2>&1 | tee baseline-timing.log
|
||||
grep -E "passed|failed|skipped" baseline-timing.log
|
||||
```
|
||||
|
||||
### Incremental Verification
|
||||
|
||||
```bash
|
||||
# After Phase 1: verify wait replacement
|
||||
npm run test:browser -- --reporter=list 2>&1 | grep -E "passed|failed|slow"
|
||||
|
||||
# After Phase 2: verify global auth
|
||||
npm run test:browser -- --trace on
|
||||
# Check trace for login occurrences (should be 1)
|
||||
|
||||
# After Phase 3: verify parallel execution
|
||||
npm run test:browser -- --project=functional &
|
||||
npm run test:browser -- --project=visual &
|
||||
wait
|
||||
```
|
||||
|
||||
### Success Criteria
|
||||
|
||||
| Metric | Target | Measurement |
|
||||
| ---------------------- | ------ | -------------------------------------- |
|
||||
| Total runtime | <150s | `time npm run test:browser` |
|
||||
| Login count | 1 | Grep traces for `/login` navigation |
|
||||
| Flaky rate | <2% | 50 consecutive CI runs |
|
||||
| `waitForTimeout` count | <5 | `grep -r waitForTimeout tests/browser` |
|
||||
|
||||
## Lessons Learned / Patterns for Other Projects
|
||||
|
||||
### Pattern 1: Always Prefer Event-Based Waits
|
||||
|
||||
```javascript
|
||||
// Bad
|
||||
await page.click('#submit');
|
||||
await page.waitForTimeout(2000);
|
||||
expect(await page.title()).toBe('Success');
|
||||
|
||||
// Good
|
||||
await Promise.all([page.waitForNavigation(), page.click('#submit')]);
|
||||
expect(await page.title()).toBe('Success');
|
||||
```
|
||||
|
||||
### Pattern 2: Global Setup for Authentication
|
||||
|
||||
Playwright's `storageState` feature should be the default for any authenticated app:
|
||||
|
||||
1. Create `global-setup.js` that performs login once
|
||||
2. Save cookies/storage to JSON file
|
||||
3. Configure `storageState` in `playwright.config.js`
|
||||
4. Tests start authenticated with zero overhead
|
||||
|
||||
### Pattern 3: Separate Test Types by Execution Characteristics
|
||||
|
||||
| Test Type | Characteristics | Strategy |
|
||||
| ---------- | ------------------------ | --------------------------------- |
|
||||
| Functional | Fast, deterministic | Run first, gate deployment |
|
||||
| Visual | Slow, baseline-dependent | Run in parallel, separate project |
|
||||
| E2E | Cross-service, slow | Run nightly, separate workflow |
|
||||
|
||||
### Pattern 4: Measure Before and After
|
||||
|
||||
Always establish baseline metrics before optimization:
|
||||
|
||||
```bash
|
||||
# Essential metrics to capture
|
||||
time npm run test:browser # Total runtime
|
||||
grep -c waitForTimeout *.js # Hardcoded wait count
|
||||
grep -c 'await login' *.js # Login call count
|
||||
```
|
||||
|
||||
## Related ADRs
|
||||
|
||||
- [ADR-0031](0031-quality-gates-eslint-playwright.md): Quality Gates - ESLint, Pre-commit Hooks, and Playwright Browser Testing
|
||||
- [ADR-0035](0035-browser-test-selector-fixes.md): Browser Test Selector Fixes
|
||||
- [ADR-0008](0008-testing-strategy.md): Testing Strategy
|
||||
|
||||
## References
|
||||
|
||||
- Playwright Best Practices: https://playwright.dev/docs/best-practices
|
||||
- Playwright Authentication: https://playwright.dev/docs/auth
|
||||
- Playwright Wait Strategies: https://playwright.dev/docs/actionability
|
||||
- Test Files: `tests/browser/*.spec.js`
|
||||
- Helper Module: `tests/browser/helpers.js`
|
||||
- Configuration: `playwright.config.js`
|
||||
265
docs/adr/ADR-027-application-wide-structured-logging.md
Normal file
265
docs/adr/ADR-027-application-wide-structured-logging.md
Normal file
@@ -0,0 +1,265 @@
|
||||
# ADR-027: Standardized Application-Wide Structured Logging
|
||||
|
||||
**Date**: 2026-02-10
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
**Source**: Imported from flyer-crawler project (ADR-004)
|
||||
|
||||
**Related**: [ADR-017](ADR-017-structured-logging-with-pino.md), [ADR-028](ADR-028-client-side-structured-logging.md), [ADR-029](ADR-029-error-tracking-with-bugsink.md)
|
||||
|
||||
## Context
|
||||
|
||||
While ADR-017 established Pino as our logging framework, this ADR extends that foundation with application-wide standards for request tracing, context propagation, and structured log formats.
|
||||
|
||||
The implementation of logging can vary significantly across different modules. The error handler middleware may produce high-quality, structured JSON logs for errors, but logging within route handlers and service layers can become ad-hoc, using plain strings or inconsistent object structures.
|
||||
|
||||
This inconsistency leads to several problems:
|
||||
|
||||
- **Difficult Debugging**: It is hard to trace a single user request through the system or correlate events related to a specific operation
|
||||
- **Ineffective Log Analysis**: Inconsistent log formats make it difficult to effectively query, filter, and create dashboards in log management systems (like Datadog, Splunk, or the ELK stack)
|
||||
- **Security Risks**: There is no enforced standard for redacting sensitive information (like passwords or tokens) in logs outside of the error handler, increasing the risk of accidental data exposure
|
||||
- **Missing Context**: Logs often lack crucial context, such as a unique request ID, the authenticated user's ID, or the source IP address, making them less useful for diagnosing issues
|
||||
|
||||
## Decision
|
||||
|
||||
We will adopt a standardized, application-wide structured logging policy. All log entries MUST be in JSON format and adhere to a consistent schema.
|
||||
|
||||
### 1. Request-Scoped Logger with Context
|
||||
|
||||
We will create a middleware that runs at the beginning of the request lifecycle. This middleware will:
|
||||
|
||||
- Generate a unique `request_id` for each incoming request
|
||||
- Create a request-scoped logger instance (a "child logger") that automatically includes the `request_id`, `user_id` (if authenticated), and `ip_address` in every log message it generates
|
||||
- Attach this child logger to the `req` object (e.g., `req.log`)
|
||||
|
||||
### 2. Mandatory Use of Request-Scoped Logger
|
||||
|
||||
All route handlers and any service functions called by them **MUST** use the request-scoped logger (`req.log`) instead of the global logger instance. This ensures all logs for a given request are automatically correlated.
|
||||
|
||||
### 3. Standardized Log Schema
|
||||
|
||||
All log messages should follow a base schema. The logger configuration will be updated to enforce this.
|
||||
|
||||
**Base Fields**: `level`, `timestamp`, `message`, `request_id`, `user_id`, `ip_address`
|
||||
|
||||
**Error Fields**: When logging an error, the log entry MUST include an `error` object with `name`, `message`, and `stack`.
|
||||
|
||||
### 4. Standardized Logging Practices
|
||||
|
||||
| Level | HTTP Status | Scenario |
|
||||
| ----- | ----------- | -------------------------------------------------- |
|
||||
| DEBUG | Any | Request incoming, internal state, development info |
|
||||
| INFO | 2xx | Successful requests, business events |
|
||||
| WARN | 4xx | Client errors, validation failures, not found |
|
||||
| ERROR | 5xx | Server errors, unhandled exceptions |
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Logger Configuration
|
||||
|
||||
Located in `src/services/logger.server.ts`:
|
||||
|
||||
```typescript
|
||||
import pino from 'pino';
|
||||
|
||||
const isProduction = process.env.NODE_ENV === 'production';
|
||||
const isTest = process.env.NODE_ENV === 'test';
|
||||
|
||||
export const logger = pino({
|
||||
level: isProduction ? 'info' : 'debug',
|
||||
transport:
|
||||
isProduction || isTest
|
||||
? undefined
|
||||
: {
|
||||
target: 'pino-pretty',
|
||||
options: {
|
||||
colorize: true,
|
||||
translateTime: 'SYS:standard',
|
||||
ignore: 'pid,hostname',
|
||||
},
|
||||
},
|
||||
redact: {
|
||||
paths: [
|
||||
'req.headers.authorization',
|
||||
'req.headers.cookie',
|
||||
'*.body.password',
|
||||
'*.body.newPassword',
|
||||
'*.body.currentPassword',
|
||||
'*.body.confirmPassword',
|
||||
'*.body.refreshToken',
|
||||
'*.body.token',
|
||||
],
|
||||
censor: '[REDACTED]',
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
### Request Logger Middleware
|
||||
|
||||
Located in `server.ts`:
|
||||
|
||||
```typescript
|
||||
import { randomUUID } from 'crypto';
|
||||
import type { Request, Response, NextFunction } from 'express';
|
||||
import { logger } from './services/logger.server';
|
||||
|
||||
const requestLogger = (req: Request, res: Response, next: NextFunction) => {
|
||||
const requestId = randomUUID();
|
||||
const user = req.user as UserProfile | undefined;
|
||||
const start = process.hrtime();
|
||||
|
||||
// Create request-scoped logger
|
||||
req.log = logger.child({
|
||||
request_id: requestId,
|
||||
user_id: user?.user.user_id,
|
||||
ip_address: req.ip,
|
||||
});
|
||||
|
||||
req.log.debug({ method: req.method, originalUrl: req.originalUrl }, 'INCOMING');
|
||||
|
||||
res.on('finish', () => {
|
||||
const duration = getDurationInMilliseconds(start);
|
||||
const { statusCode, statusMessage } = res;
|
||||
const logDetails = {
|
||||
user_id: (req.user as UserProfile | undefined)?.user.user_id,
|
||||
method: req.method,
|
||||
originalUrl: req.originalUrl,
|
||||
statusCode,
|
||||
statusMessage,
|
||||
duration: duration.toFixed(2),
|
||||
};
|
||||
|
||||
// Include request details for failed requests (for debugging)
|
||||
if (statusCode >= 400) {
|
||||
logDetails.req = { headers: req.headers, body: req.body };
|
||||
}
|
||||
|
||||
if (statusCode >= 500) req.log.error(logDetails, 'Request completed with server error');
|
||||
else if (statusCode >= 400) req.log.warn(logDetails, 'Request completed with client error');
|
||||
else req.log.info(logDetails, 'Request completed successfully');
|
||||
});
|
||||
|
||||
next();
|
||||
};
|
||||
|
||||
app.use(requestLogger);
|
||||
```
|
||||
|
||||
### TypeScript Support
|
||||
|
||||
The `req.log` property is typed via declaration merging in `src/types/express.d.ts`:
|
||||
|
||||
```typescript
|
||||
import { Logger } from 'pino';
|
||||
|
||||
declare global {
|
||||
namespace Express {
|
||||
export interface Request {
|
||||
log: Logger;
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Automatic Sensitive Data Redaction
|
||||
|
||||
The Pino logger automatically redacts sensitive fields:
|
||||
|
||||
```json
|
||||
// Before redaction
|
||||
{
|
||||
"body": {
|
||||
"email": "user@example.com",
|
||||
"password": "secret123",
|
||||
"newPassword": "newsecret456"
|
||||
}
|
||||
}
|
||||
|
||||
// After redaction (in logs)
|
||||
{
|
||||
"body": {
|
||||
"email": "user@example.com",
|
||||
"password": "[REDACTED]",
|
||||
"newPassword": "[REDACTED]"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Service Layer Logging
|
||||
|
||||
Services accept the request-scoped logger as an optional parameter:
|
||||
|
||||
```typescript
|
||||
export async function registerUser(email: string, password: string, reqLog?: Logger) {
|
||||
const log = reqLog || logger; // Fall back to global logger
|
||||
|
||||
log.info({ email }, 'Registering new user');
|
||||
// ... implementation
|
||||
|
||||
log.debug({ userId: user.user_id }, 'User created successfully');
|
||||
return user;
|
||||
}
|
||||
|
||||
// In route handler
|
||||
router.post('/register', async (req, res, next) => {
|
||||
await authService.registerUser(req.body.email, req.body.password, req.log);
|
||||
});
|
||||
```
|
||||
|
||||
### Log Output Format
|
||||
|
||||
**Development** (pino-pretty):
|
||||
|
||||
```text
|
||||
[2026-01-09 12:34:56.789] INFO (request_id=abc123): Request completed successfully
|
||||
method: "GET"
|
||||
originalUrl: "/api/users"
|
||||
statusCode: 200
|
||||
duration: "45.23"
|
||||
```
|
||||
|
||||
**Production** (JSON):
|
||||
|
||||
```json
|
||||
{
|
||||
"level": 30,
|
||||
"time": 1704812096789,
|
||||
"request_id": "abc123",
|
||||
"user_id": "user_456",
|
||||
"ip_address": "192.168.1.1",
|
||||
"method": "GET",
|
||||
"originalUrl": "/api/users",
|
||||
"statusCode": 200,
|
||||
"duration": "45.23",
|
||||
"msg": "Request completed successfully"
|
||||
}
|
||||
```
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Enhanced Observability**: Every log line from a single request can be instantly grouped and analyzed, dramatically speeding up debugging
|
||||
- **Improved Security**: Centralizing the addition of context (like `user_id`) reduces the chance of developers manually logging sensitive data
|
||||
- **Scalable Log Management**: Consistent JSON logs are easily ingested and indexed by any modern log aggregation tool
|
||||
- **Clearer Code**: Removes the need to manually pass contextual information (like user ID) down to service functions just for logging purposes
|
||||
|
||||
### Negative
|
||||
|
||||
- **Refactoring Effort**: Requires adding the `requestLogger` middleware and refactoring all routes and services to use `req.log` instead of the global `logger`
|
||||
- **Slight Performance Overhead**: Creating a child logger for every request adds a minor performance cost, though this is negligible for most modern logging libraries
|
||||
|
||||
## Key Files
|
||||
|
||||
- `src/services/logger.server.ts` - Pino logger configuration
|
||||
- `src/services/logger.client.ts` - Client-side logger (for frontend)
|
||||
- `src/types/express.d.ts` - TypeScript declaration for `req.log`
|
||||
- `server.ts` - Request logger middleware
|
||||
|
||||
## References
|
||||
|
||||
- [ADR-017: Structured Logging with Pino](ADR-017-structured-logging-with-pino.md)
|
||||
- [ADR-001: Standardized Error Handling](ADR-001-standardized-error-handling.md) - Error handler uses `req.log` for error logging
|
||||
- [ADR-028: Client-Side Structured Logging](ADR-028-client-side-structured-logging.md) - Client-side logging strategy
|
||||
- [Pino Documentation](https://getpino.io/#/)
|
||||
242
docs/adr/ADR-028-client-side-structured-logging.md
Normal file
242
docs/adr/ADR-028-client-side-structured-logging.md
Normal file
@@ -0,0 +1,242 @@
|
||||
# ADR-028: Standardized Client-Side Structured Logging
|
||||
|
||||
**Date**: 2026-02-10
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
**Source**: Imported from flyer-crawler project (ADR-026)
|
||||
|
||||
**Related**: [ADR-027](ADR-027-application-wide-structured-logging.md), [ADR-029](ADR-029-error-tracking-with-bugsink.md)
|
||||
|
||||
## Context
|
||||
|
||||
Following the standardization of backend logging in ADR-027, it is clear that our frontend components also require a consistent logging strategy. Currently, components either use `console.log` directly or a simple wrapper, but without a formal standard, this can lead to inconsistent log formats and difficulty in debugging user-facing issues.
|
||||
|
||||
While the frontend does not have the concept of a "request-scoped" logger, the principles of structured, context-rich logging are equally important for:
|
||||
|
||||
1. **Effective Debugging**: Understanding the state of a component or the sequence of user interactions that led to an error
|
||||
2. **Integration with Monitoring Tools**: Sending structured logs to services like Sentry/Bugsink or LogRocket allows for powerful analysis and error tracking in production
|
||||
3. **Clean Test Outputs**: Uncontrolled logging can pollute test runner output, making it difficult to spot actual test failures
|
||||
|
||||
An existing client-side logger at `src/services/logger.client.ts` provides a simple, structured logging interface. This ADR formalizes its use as the application standard.
|
||||
|
||||
## Decision
|
||||
|
||||
We will adopt a standardized, application-wide structured logging policy for all client-side (React) code.
|
||||
|
||||
### 1. Mandatory Use of the Global Client Logger
|
||||
|
||||
All frontend components, hooks, and services **MUST** use the global logger singleton exported from `src/services/logger.client.ts`. Direct use of `console.log`, `console.error`, etc., is discouraged.
|
||||
|
||||
### 2. Pino-like API for Structured Logging
|
||||
|
||||
The client logger mimics the `pino` API, which is the standard on the backend. It supports two primary call signatures:
|
||||
|
||||
- `logger.info('A simple message');`
|
||||
- `logger.info({ key: 'value' }, 'A message with a structured data payload');`
|
||||
|
||||
The second signature, which includes a data object as the first argument, is **strongly preferred**, especially for logging errors or complex state.
|
||||
|
||||
### 3. Mocking in Tests
|
||||
|
||||
All Jest/Vitest tests for components or hooks that use the logger **MUST** mock the `src/services/logger.client.ts` module. This prevents logs from appearing in test output and allows for assertions that the logger was called correctly.
|
||||
|
||||
## Implementation
|
||||
|
||||
### Client Logger Service
|
||||
|
||||
Located in `src/services/logger.client.ts`:
|
||||
|
||||
```typescript
|
||||
type LogLevel = 'debug' | 'info' | 'warn' | 'error';
|
||||
|
||||
interface LoggerOptions {
|
||||
level?: LogLevel;
|
||||
enabled?: boolean;
|
||||
}
|
||||
|
||||
const LOG_LEVELS: Record<LogLevel, number> = {
|
||||
debug: 0,
|
||||
info: 1,
|
||||
warn: 2,
|
||||
error: 3,
|
||||
};
|
||||
|
||||
class ClientLogger {
|
||||
private level: LogLevel;
|
||||
private enabled: boolean;
|
||||
|
||||
constructor(options: LoggerOptions = {}) {
|
||||
this.level = options.level ?? 'info';
|
||||
this.enabled = options.enabled ?? import.meta.env.DEV;
|
||||
}
|
||||
|
||||
private shouldLog(level: LogLevel): boolean {
|
||||
return this.enabled && LOG_LEVELS[level] >= LOG_LEVELS[this.level];
|
||||
}
|
||||
|
||||
private formatMessage(data: object | string, message?: string): string {
|
||||
if (typeof data === 'string') {
|
||||
return data;
|
||||
}
|
||||
const payload = JSON.stringify(data, null, 2);
|
||||
return message ? `${message}\n${payload}` : payload;
|
||||
}
|
||||
|
||||
debug(data: object | string, message?: string): void {
|
||||
if (this.shouldLog('debug')) {
|
||||
console.debug(`[DEBUG] ${this.formatMessage(data, message)}`);
|
||||
}
|
||||
}
|
||||
|
||||
info(data: object | string, message?: string): void {
|
||||
if (this.shouldLog('info')) {
|
||||
console.info(`[INFO] ${this.formatMessage(data, message)}`);
|
||||
}
|
||||
}
|
||||
|
||||
warn(data: object | string, message?: string): void {
|
||||
if (this.shouldLog('warn')) {
|
||||
console.warn(`[WARN] ${this.formatMessage(data, message)}`);
|
||||
}
|
||||
}
|
||||
|
||||
error(data: object | string, message?: string): void {
|
||||
if (this.shouldLog('error')) {
|
||||
console.error(`[ERROR] ${this.formatMessage(data, message)}`);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
export const logger = new ClientLogger({
|
||||
level: import.meta.env.DEV ? 'debug' : 'warn',
|
||||
enabled: true,
|
||||
});
|
||||
```
|
||||
|
||||
### Example Usage
|
||||
|
||||
**Logging an Error in a Component:**
|
||||
|
||||
```typescript
|
||||
// In a React component or hook
|
||||
import { logger } from '../services/logger.client';
|
||||
import { notifyError } from '../services/notificationService';
|
||||
|
||||
const fetchData = async () => {
|
||||
try {
|
||||
const data = await apiClient.getData();
|
||||
return data;
|
||||
} catch (err) {
|
||||
// Log the full error object for context, along with a descriptive message.
|
||||
logger.error({ err }, 'Failed to fetch component data');
|
||||
notifyError('Something went wrong. Please try again.');
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**Logging State Changes:**
|
||||
|
||||
```typescript
|
||||
// In a Zustand store or state hook
|
||||
import { logger } from '../services/logger.client';
|
||||
|
||||
const useAuthStore = create((set) => ({
|
||||
login: async (credentials) => {
|
||||
logger.info({ email: credentials.email }, 'User login attempt');
|
||||
try {
|
||||
const user = await authService.login(credentials);
|
||||
logger.info({ userId: user.id }, 'User logged in successfully');
|
||||
set({ user, isAuthenticated: true });
|
||||
} catch (error) {
|
||||
logger.error({ error }, 'Login failed');
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
}));
|
||||
```
|
||||
|
||||
### Mocking the Logger in Tests
|
||||
|
||||
```typescript
|
||||
// In a *.test.tsx file
|
||||
import { vi } from 'vitest';
|
||||
|
||||
// Mock the logger at the top of the test file
|
||||
vi.mock('../services/logger.client', () => ({
|
||||
logger: {
|
||||
info: vi.fn(),
|
||||
warn: vi.fn(),
|
||||
error: vi.fn(),
|
||||
debug: vi.fn(),
|
||||
},
|
||||
}));
|
||||
|
||||
describe('MyComponent', () => {
|
||||
beforeEach(() => {
|
||||
vi.clearAllMocks(); // Clear mocks between tests
|
||||
});
|
||||
|
||||
it('should log an error when fetching fails', async () => {
|
||||
// ... test setup to make fetch fail ...
|
||||
|
||||
// Assert that the logger was called with the expected structure
|
||||
expect(logger.error).toHaveBeenCalledWith(
|
||||
expect.objectContaining({ err: expect.any(Error) }),
|
||||
'Failed to fetch component data',
|
||||
);
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
## Integration with Error Tracking
|
||||
|
||||
When using Sentry/Bugsink for error tracking (see ADR-029), the client logger can be extended to send logs as breadcrumbs:
|
||||
|
||||
```typescript
|
||||
import * as Sentry from '@sentry/react';
|
||||
|
||||
class ClientLogger {
|
||||
// ... existing implementation
|
||||
|
||||
error(data: object | string, message?: string): void {
|
||||
if (this.shouldLog('error')) {
|
||||
console.error(`[ERROR] ${this.formatMessage(data, message)}`);
|
||||
}
|
||||
|
||||
// Add to Sentry breadcrumbs for error context
|
||||
Sentry.addBreadcrumb({
|
||||
category: 'log',
|
||||
level: 'error',
|
||||
message: typeof data === 'string' ? data : message,
|
||||
data: typeof data === 'object' ? data : undefined,
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Consistency**: All client-side logs will have a predictable structure, making them easier to read and parse
|
||||
- **Debuggability**: Errors logged with a full object (`{ err }`) capture the stack trace and other properties, which is invaluable for debugging
|
||||
- **Testability**: Components that log are easier to test without polluting CI/CD output. We can also assert that logging occurs when expected
|
||||
- **Future-Proof**: If we later decide to send client-side logs to a remote service, we only need to modify the central `logger.client.ts` file instead of every component
|
||||
- **Error Tracking Integration**: Logs can be used as breadcrumbs in Sentry/Bugsink for better error context
|
||||
|
||||
### Negative
|
||||
|
||||
- **Minor Boilerplate**: Requires importing the logger in every file that needs it and mocking it in every corresponding test file. However, this is a small and consistent effort
|
||||
- **Production Noise**: Care must be taken to configure appropriate log levels in production to avoid performance impact
|
||||
|
||||
## Key Files
|
||||
|
||||
- `src/services/logger.client.ts` - Client-side logger implementation
|
||||
- `src/services/logger.server.ts` - Backend logger (for reference)
|
||||
|
||||
## References
|
||||
|
||||
- [ADR-027: Application-Wide Structured Logging](ADR-027-application-wide-structured-logging.md)
|
||||
- [ADR-029: Error Tracking with Bugsink](ADR-029-error-tracking-with-bugsink.md)
|
||||
- [Pino Documentation](https://getpino.io/#/)
|
||||
389
docs/adr/ADR-029-error-tracking-with-bugsink.md
Normal file
389
docs/adr/ADR-029-error-tracking-with-bugsink.md
Normal file
@@ -0,0 +1,389 @@
|
||||
# ADR-029: Error Tracking and Observability with Bugsink
|
||||
|
||||
**Date**: 2026-02-10
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
**Source**: Imported from flyer-crawler project (ADR-015)
|
||||
|
||||
**Related**: [ADR-027](ADR-027-application-wide-structured-logging.md), [ADR-028](ADR-028-client-side-structured-logging.md), [ADR-030](ADR-030-postgresql-function-observability.md), [ADR-032](ADR-032-application-performance-monitoring.md)
|
||||
|
||||
## Context
|
||||
|
||||
While ADR-027 established structured logging with Pino, the application lacks a high-level, aggregated view of its health and errors. It is difficult to spot trends, identify recurring issues, or be proactively notified of new types of errors.
|
||||
|
||||
Key requirements:
|
||||
|
||||
1. **Self-hosted**: No external SaaS dependencies for error tracking
|
||||
2. **Sentry SDK compatible**: Leverage mature, well-documented SDKs
|
||||
3. **Lightweight**: Minimal resource overhead in the dev container
|
||||
4. **Production-ready**: Same architecture works on bare-metal production servers
|
||||
5. **AI-accessible**: MCP server integration for Claude Code and other AI tools
|
||||
|
||||
**Note**: Application Performance Monitoring (APM) and distributed tracing are covered separately in [ADR-032](ADR-032-application-performance-monitoring.md).
|
||||
|
||||
## Decision
|
||||
|
||||
We implement a self-hosted error tracking stack using **Bugsink** as the Sentry-compatible backend, with the following components:
|
||||
|
||||
### 1. Error Tracking Backend: Bugsink
|
||||
|
||||
**Bugsink** is a lightweight, self-hosted Sentry alternative that:
|
||||
|
||||
- Runs as a single process (no Kafka, Redis, ClickHouse required)
|
||||
- Is fully compatible with Sentry SDKs
|
||||
- Supports ARM64 and AMD64 architectures
|
||||
- Can use SQLite (dev) or PostgreSQL (production)
|
||||
|
||||
**Deployment**:
|
||||
|
||||
- **Dev container**: Installed as a systemd service inside the container
|
||||
- **Production**: Runs as a systemd service on bare-metal, listening on localhost only
|
||||
- **Database**: Uses PostgreSQL with a dedicated `bugsink` user and `bugsink` database (same PostgreSQL instance as the main application)
|
||||
|
||||
### 2. Backend Integration: @sentry/node
|
||||
|
||||
The Express backend integrates `@sentry/node` SDK to:
|
||||
|
||||
- Capture unhandled exceptions before PM2/process manager restarts
|
||||
- Report errors with full stack traces and context
|
||||
- Integrate with Pino logger for breadcrumbs
|
||||
- Filter errors by severity (only 5xx errors sent by default)
|
||||
|
||||
### 3. Frontend Integration: @sentry/react
|
||||
|
||||
The React frontend integrates `@sentry/react` SDK to:
|
||||
|
||||
- Wrap the app in an Error Boundary for graceful error handling
|
||||
- Capture unhandled JavaScript errors
|
||||
- Report errors with component stack traces
|
||||
- Filter out browser extension errors
|
||||
- **Frontend Error Correlation**: The global API client intercepts 4xx/5xx responses and can attach the `x-request-id` header to Sentry scope for correlation with backend logs
|
||||
|
||||
### 4. Log Aggregation: Logstash
|
||||
|
||||
**Logstash** parses application and infrastructure logs, forwarding error patterns to Bugsink:
|
||||
|
||||
- **Installation**: Installed inside the dev container (and on bare-metal prod servers)
|
||||
- **Inputs**:
|
||||
- Pino JSON logs from the Node.js application (PM2 managed)
|
||||
- Redis logs (connection errors, memory warnings, slow commands)
|
||||
- PostgreSQL function logs (via `fn_log()` - see ADR-030)
|
||||
- NGINX access/error logs
|
||||
- **Filter**: Identifies error-level logs (5xx responses, unhandled exceptions, Redis errors)
|
||||
- **Output**: Sends to Bugsink via Sentry-compatible HTTP API
|
||||
|
||||
This provides a secondary error capture path for:
|
||||
|
||||
- Errors that occur before Sentry SDK initialization
|
||||
- Log-based errors that do not throw exceptions
|
||||
- Redis connection/performance issues
|
||||
- Database function errors and slow queries
|
||||
- Historical error analysis from log files
|
||||
|
||||
### 5. MCP Server Integration: bugsink-mcp
|
||||
|
||||
For AI tool integration (Claude Code, Cursor, etc.), we use the open-source [bugsink-mcp](https://github.com/j-shelfwood/bugsink-mcp) server:
|
||||
|
||||
- **No code changes required**: Configurable via environment variables
|
||||
- **Capabilities**: List projects, get issues, view events, get stacktraces, manage releases
|
||||
- **Configuration**:
|
||||
- `BUGSINK_URL`: Points to Bugsink instance (`http://localhost:8000` for dev, `https://bugsink.example.com` for prod)
|
||||
- `BUGSINK_API_TOKEN`: API token from Bugsink (created via Django management command)
|
||||
- `BUGSINK_ORG_SLUG`: Organization identifier (usually "sentry")
|
||||
|
||||
## Architecture
|
||||
|
||||
```text
|
||||
+---------------------------------------------------------------------------+
|
||||
| Dev Container / Production Server |
|
||||
+---------------------------------------------------------------------------+
|
||||
| |
|
||||
| +------------------+ +------------------+ |
|
||||
| | Frontend | | Backend | |
|
||||
| | (React) | | (Express) | |
|
||||
| | @sentry/react | | @sentry/node | |
|
||||
| +--------+---------+ +--------+---------+ |
|
||||
| | | |
|
||||
| | Sentry SDK Protocol | |
|
||||
| +-----------+---------------+ |
|
||||
| | |
|
||||
| v |
|
||||
| +----------------------+ |
|
||||
| | Bugsink | |
|
||||
| | (localhost:8000) |<------------------+ |
|
||||
| | | | |
|
||||
| | PostgreSQL backend | | |
|
||||
| +----------------------+ | |
|
||||
| | |
|
||||
| +----------------------+ | |
|
||||
| | Logstash |-------------------+ |
|
||||
| | (Log Aggregator) | Sentry Output |
|
||||
| | | |
|
||||
| | Inputs: | |
|
||||
| | - PM2/Pino logs | |
|
||||
| | - Redis logs | |
|
||||
| | - PostgreSQL logs | |
|
||||
| | - NGINX logs | |
|
||||
| +----------------------+ |
|
||||
| ^ ^ ^ ^ |
|
||||
| | | | | |
|
||||
| +-----------+ | | +-----------+ |
|
||||
| | | | | |
|
||||
| +----+-----+ +-----+----+ +-----+----+ +-----+----+ |
|
||||
| | PM2 | | Redis | | PostgreSQL| | NGINX | |
|
||||
| | Logs | | Logs | | Logs | | Logs | |
|
||||
| +----------+ +----------+ +-----------+ +---------+ |
|
||||
| |
|
||||
| +----------------------+ |
|
||||
| | PostgreSQL | |
|
||||
| | +----------------+ | |
|
||||
| | | app_database | | (main app database) |
|
||||
| | +----------------+ | |
|
||||
| | | bugsink | | (error tracking database) |
|
||||
| | +----------------+ | |
|
||||
| +----------------------+ |
|
||||
| |
|
||||
+---------------------------------------------------------------------------+
|
||||
|
||||
External (Developer Machine):
|
||||
+--------------------------------------+
|
||||
| Claude Code / Cursor / VS Code |
|
||||
| +--------------------------------+ |
|
||||
| | bugsink-mcp | |
|
||||
| | (MCP Server) | |
|
||||
| | | |
|
||||
| | BUGSINK_URL=http://localhost:8000
|
||||
| | BUGSINK_API_TOKEN=... | |
|
||||
| | BUGSINK_ORG_SLUG=... | |
|
||||
| +--------------------------------+ |
|
||||
+--------------------------------------+
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Description | Default (Dev) |
|
||||
| -------------------- | -------------------------------- | -------------------------- |
|
||||
| `SENTRY_DSN` | Sentry-compatible DSN (backend) | Set after project creation |
|
||||
| `VITE_SENTRY_DSN` | Sentry-compatible DSN (frontend) | Set after project creation |
|
||||
| `SENTRY_ENVIRONMENT` | Environment name | `development` |
|
||||
| `SENTRY_DEBUG` | Enable debug logging | `false` |
|
||||
| `SENTRY_ENABLED` | Enable/disable error reporting | `true` |
|
||||
|
||||
### PostgreSQL Setup
|
||||
|
||||
```sql
|
||||
-- Create dedicated Bugsink database and user
|
||||
CREATE USER bugsink WITH PASSWORD 'bugsink_dev_password';
|
||||
CREATE DATABASE bugsink OWNER bugsink;
|
||||
GRANT ALL PRIVILEGES ON DATABASE bugsink TO bugsink;
|
||||
```
|
||||
|
||||
### Bugsink Configuration
|
||||
|
||||
```bash
|
||||
# Environment variables for Bugsink service
|
||||
SECRET_KEY=<random-50-char-string>
|
||||
DATABASE_URL=postgresql://bugsink:bugsink_dev_password@localhost:5432/bugsink
|
||||
BASE_URL=http://localhost:8000
|
||||
PORT=8000
|
||||
```
|
||||
|
||||
### Backend Sentry Integration
|
||||
|
||||
Located in `src/services/sentry.server.ts`:
|
||||
|
||||
```typescript
|
||||
import * as Sentry from '@sentry/node';
|
||||
import { config } from '../config/env';
|
||||
|
||||
export function initSentry() {
|
||||
if (!config.sentry.enabled || !config.sentry.dsn) {
|
||||
return;
|
||||
}
|
||||
|
||||
Sentry.init({
|
||||
dsn: config.sentry.dsn,
|
||||
environment: config.sentry.environment || config.server.nodeEnv,
|
||||
debug: config.sentry.debug,
|
||||
|
||||
// Performance monitoring - disabled by default (see ADR-032)
|
||||
tracesSampleRate: 0,
|
||||
|
||||
// Filter out 4xx errors - only report server errors
|
||||
beforeSend(event) {
|
||||
const statusCode = event.contexts?.response?.status_code;
|
||||
if (statusCode && statusCode >= 400 && statusCode < 500) {
|
||||
return null;
|
||||
}
|
||||
return event;
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
// Set user context after authentication
|
||||
export function setUserContext(user: { id: string; email: string; name?: string }) {
|
||||
Sentry.setUser({
|
||||
id: user.id,
|
||||
email: user.email,
|
||||
username: user.name,
|
||||
});
|
||||
}
|
||||
|
||||
// Clear user context on logout
|
||||
export function clearUserContext() {
|
||||
Sentry.setUser(null);
|
||||
}
|
||||
```
|
||||
|
||||
### Frontend Sentry Integration
|
||||
|
||||
Located in `src/services/sentry.client.ts`:
|
||||
|
||||
```typescript
|
||||
import * as Sentry from '@sentry/react';
|
||||
import { config } from '../config';
|
||||
|
||||
export function initSentry() {
|
||||
if (!config.sentry.enabled || !config.sentry.dsn) {
|
||||
return;
|
||||
}
|
||||
|
||||
Sentry.init({
|
||||
dsn: config.sentry.dsn,
|
||||
environment: config.sentry.environment,
|
||||
|
||||
// Performance monitoring - disabled by default (see ADR-032)
|
||||
tracesSampleRate: 0,
|
||||
|
||||
// Filter out browser extension errors
|
||||
beforeSend(event) {
|
||||
// Ignore errors from browser extensions
|
||||
if (
|
||||
event.exception?.values?.[0]?.stacktrace?.frames?.some((frame) =>
|
||||
frame.filename?.includes('extension://'),
|
||||
)
|
||||
) {
|
||||
return null;
|
||||
}
|
||||
return event;
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
// Set user context after login
|
||||
export function setUserContext(user: { id: string; email: string; name?: string }) {
|
||||
Sentry.setUser({
|
||||
id: user.id,
|
||||
email: user.email,
|
||||
username: user.name,
|
||||
});
|
||||
}
|
||||
|
||||
// Clear user context on logout
|
||||
export function clearUserContext() {
|
||||
Sentry.setUser(null);
|
||||
}
|
||||
```
|
||||
|
||||
### Error Boundary Component
|
||||
|
||||
Located in `src/components/ErrorBoundary.tsx`:
|
||||
|
||||
```typescript
|
||||
import * as Sentry from '@sentry/react';
|
||||
import { Component, ErrorInfo, ReactNode } from 'react';
|
||||
|
||||
interface Props {
|
||||
children: ReactNode;
|
||||
fallback?: ReactNode;
|
||||
}
|
||||
|
||||
interface State {
|
||||
hasError: boolean;
|
||||
}
|
||||
|
||||
export class ErrorBoundary extends Component<Props, State> {
|
||||
constructor(props: Props) {
|
||||
super(props);
|
||||
this.state = { hasError: false };
|
||||
}
|
||||
|
||||
static getDerivedStateFromError(): State {
|
||||
return { hasError: true };
|
||||
}
|
||||
|
||||
componentDidCatch(error: Error, errorInfo: ErrorInfo) {
|
||||
Sentry.withScope((scope) => {
|
||||
scope.setExtras({ componentStack: errorInfo.componentStack });
|
||||
Sentry.captureException(error);
|
||||
});
|
||||
}
|
||||
|
||||
render() {
|
||||
if (this.state.hasError) {
|
||||
return this.props.fallback || (
|
||||
<div className="error-boundary">
|
||||
<h1>Something went wrong</h1>
|
||||
<p>Please refresh the page or contact support.</p>
|
||||
</div>
|
||||
);
|
||||
}
|
||||
|
||||
return this.props.children;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Logstash Pipeline Configuration
|
||||
|
||||
Key routing for log sources:
|
||||
|
||||
| Source | Bugsink Project |
|
||||
| --------------- | --------------- |
|
||||
| Backend (Pino) | Backend API |
|
||||
| Worker (Pino) | Backend API |
|
||||
| PostgreSQL logs | Backend API |
|
||||
| Vite logs | Infrastructure |
|
||||
| Redis logs | Infrastructure |
|
||||
| NGINX logs | Infrastructure |
|
||||
| Frontend errors | Frontend |
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Full observability**: Aggregated view of errors and trends
|
||||
- **Self-hosted**: No external SaaS dependencies or subscription costs
|
||||
- **SDK compatibility**: Leverages mature Sentry SDKs with excellent documentation
|
||||
- **AI integration**: MCP server enables Claude Code to query and analyze errors
|
||||
- **Unified architecture**: Same setup works in dev container and production
|
||||
- **Lightweight**: Bugsink runs in a single process, unlike full Sentry (16GB+ RAM)
|
||||
- **Error correlation**: Request IDs allow correlation between frontend errors and backend logs
|
||||
|
||||
### Negative
|
||||
|
||||
- **Additional services**: Bugsink and Logstash add complexity to the container
|
||||
- **PostgreSQL overhead**: Additional database for error tracking
|
||||
- **Initial setup**: Requires configuration of multiple components
|
||||
- **Logstash learning curve**: Pipeline configuration requires Logstash knowledge
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
1. **Full Sentry self-hosted**: Rejected due to complexity (Kafka, Redis, ClickHouse, 16GB+ RAM minimum)
|
||||
2. **GlitchTip**: Considered, but Bugsink is lighter weight and easier to deploy
|
||||
3. **Sentry SaaS**: Rejected due to self-hosted requirement
|
||||
4. **Custom error aggregation**: Rejected in favor of proven Sentry SDK ecosystem
|
||||
|
||||
## References
|
||||
|
||||
- [Bugsink Documentation](https://www.bugsink.com/docs/)
|
||||
- [Bugsink Docker Install](https://www.bugsink.com/docs/docker-install/)
|
||||
- [@sentry/node Documentation](https://docs.sentry.io/platforms/javascript/guides/node/)
|
||||
- [@sentry/react Documentation](https://docs.sentry.io/platforms/javascript/guides/react/)
|
||||
- [bugsink-mcp](https://github.com/j-shelfwood/bugsink-mcp)
|
||||
- [Logstash Reference](https://www.elastic.co/guide/en/logstash/current/index.html)
|
||||
- [ADR-030: PostgreSQL Function Observability](ADR-030-postgresql-function-observability.md)
|
||||
- [ADR-032: Application Performance Monitoring](ADR-032-application-performance-monitoring.md)
|
||||
336
docs/adr/ADR-030-postgresql-function-observability.md
Normal file
336
docs/adr/ADR-030-postgresql-function-observability.md
Normal file
@@ -0,0 +1,336 @@
|
||||
# ADR-030: PostgreSQL Function Observability
|
||||
|
||||
**Date**: 2026-02-10
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
**Source**: Imported from flyer-crawler project (ADR-050)
|
||||
|
||||
**Related**: [ADR-029](ADR-029-error-tracking-with-bugsink.md), [ADR-027](ADR-027-application-wide-structured-logging.md)
|
||||
|
||||
## Context
|
||||
|
||||
Applications often use PostgreSQL functions and triggers for business logic, including:
|
||||
|
||||
- Data transformations and validations
|
||||
- Complex query encapsulation
|
||||
- Trigger-based side effects
|
||||
- Audit logging
|
||||
|
||||
**Current Problem**: These database functions can fail silently in several ways:
|
||||
|
||||
1. **`ON CONFLICT DO NOTHING`** - Swallows constraint violations without notification
|
||||
2. **`IF NOT FOUND THEN RETURN;`** - Silently exits when data is missing
|
||||
3. **Trigger functions returning `NULL`** - No indication of partial failures
|
||||
4. **No logging inside functions** - No visibility into function execution
|
||||
|
||||
When these silent failures occur:
|
||||
|
||||
- The application layer receives no error (function "succeeds" but does nothing)
|
||||
- No logs are generated for debugging
|
||||
- Issues are only discovered when users report missing data
|
||||
- Root cause analysis is extremely difficult
|
||||
|
||||
**Example of Silent Failure**:
|
||||
|
||||
```sql
|
||||
-- This function silently does nothing if record doesn't exist
|
||||
CREATE OR REPLACE FUNCTION public.process_item(p_user_id UUID, p_item_name TEXT)
|
||||
RETURNS void AS $$
|
||||
BEGIN
|
||||
SELECT item_id INTO v_item_id FROM items WHERE name = p_item_name;
|
||||
IF v_item_id IS NULL THEN
|
||||
RETURN; -- Silent failure - no log, no error
|
||||
END IF;
|
||||
-- ...
|
||||
END;
|
||||
$$;
|
||||
```
|
||||
|
||||
ADR-029 established Logstash + Bugsink for error tracking, with PostgreSQL log integration. This ADR defines the implementation.
|
||||
|
||||
## Decision
|
||||
|
||||
We will implement a standardized PostgreSQL function observability strategy with three tiers of logging severity.
|
||||
|
||||
### 1. Function Logging Helper
|
||||
|
||||
Create a reusable logging function that outputs structured JSON to PostgreSQL logs:
|
||||
|
||||
```sql
|
||||
-- Function to emit structured log messages from PL/pgSQL
|
||||
CREATE OR REPLACE FUNCTION public.fn_log(
|
||||
p_level TEXT, -- 'DEBUG', 'INFO', 'NOTICE', 'WARNING', 'ERROR'
|
||||
p_function_name TEXT, -- The calling function name
|
||||
p_message TEXT, -- Human-readable message
|
||||
p_context JSONB DEFAULT NULL -- Additional context (user_id, params, etc.)
|
||||
)
|
||||
RETURNS void
|
||||
LANGUAGE plpgsql
|
||||
AS $$
|
||||
DECLARE
|
||||
log_line TEXT;
|
||||
BEGIN
|
||||
-- Build structured JSON log line
|
||||
log_line := jsonb_build_object(
|
||||
'timestamp', now(),
|
||||
'level', p_level,
|
||||
'source', 'postgresql',
|
||||
'function', p_function_name,
|
||||
'message', p_message,
|
||||
'context', COALESCE(p_context, '{}'::jsonb)
|
||||
)::text;
|
||||
|
||||
-- Use appropriate RAISE level
|
||||
CASE p_level
|
||||
WHEN 'DEBUG' THEN RAISE DEBUG '%', log_line;
|
||||
WHEN 'INFO' THEN RAISE INFO '%', log_line;
|
||||
WHEN 'NOTICE' THEN RAISE NOTICE '%', log_line;
|
||||
WHEN 'WARNING' THEN RAISE WARNING '%', log_line;
|
||||
WHEN 'ERROR' THEN RAISE LOG '%', log_line; -- Use LOG for errors to ensure capture
|
||||
ELSE RAISE NOTICE '%', log_line;
|
||||
END CASE;
|
||||
END;
|
||||
$$;
|
||||
```
|
||||
|
||||
### 2. Logging Tiers
|
||||
|
||||
#### Tier 1: Critical Functions (Always Log)
|
||||
|
||||
Functions where silent failure causes data corruption or user-facing issues:
|
||||
|
||||
| Function Type | Log Events |
|
||||
| ---------------------------- | --------------------------------------- |
|
||||
| User creation/management | User creation, profile creation, errors |
|
||||
| Permission/role changes | Role not found, permission denied |
|
||||
| Financial transactions | Transaction not found, balance issues |
|
||||
| Data approval workflows | Record not found, permission denied |
|
||||
| Critical business operations | Items added, operations completed |
|
||||
|
||||
**Pattern**:
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE FUNCTION public.process_critical_operation(p_user_id UUID, p_operation_name TEXT)
|
||||
RETURNS void AS $$
|
||||
DECLARE
|
||||
v_operation_id BIGINT;
|
||||
v_context JSONB;
|
||||
BEGIN
|
||||
v_context := jsonb_build_object('user_id', p_user_id, 'operation_name', p_operation_name);
|
||||
|
||||
SELECT operation_id INTO v_operation_id
|
||||
FROM public.operations WHERE name = p_operation_name;
|
||||
|
||||
IF v_operation_id IS NULL THEN
|
||||
-- Log the issue instead of silent return
|
||||
PERFORM fn_log('WARNING', 'process_critical_operation',
|
||||
'Operation not found: ' || p_operation_name, v_context);
|
||||
RETURN;
|
||||
END IF;
|
||||
|
||||
-- Perform operation
|
||||
INSERT INTO public.user_operations (user_id, operation_id)
|
||||
VALUES (p_user_id, v_operation_id)
|
||||
ON CONFLICT (user_id, operation_id) DO NOTHING;
|
||||
|
||||
IF FOUND THEN
|
||||
PERFORM fn_log('INFO', 'process_critical_operation',
|
||||
'Operation completed: ' || p_operation_name, v_context);
|
||||
END IF;
|
||||
END;
|
||||
$$;
|
||||
```
|
||||
|
||||
#### Tier 2: Business Logic Functions (Log on Anomalies)
|
||||
|
||||
Functions where unexpected conditions should be logged but are not critical:
|
||||
|
||||
| Function Type | Log Events |
|
||||
| --------------------------- | -------------------------------- |
|
||||
| Search/suggestion functions | No match found (below threshold) |
|
||||
| Recommendation engines | No recommendations generated |
|
||||
| Data lookup functions | Empty results, no matches found |
|
||||
| Price/analytics queries | No data available, stale data |
|
||||
|
||||
**Pattern**: Log when results are unexpectedly empty or inputs are invalid.
|
||||
|
||||
#### Tier 3: Triggers (Log Errors Only)
|
||||
|
||||
Triggers should be fast, so only log when something goes wrong:
|
||||
|
||||
| Trigger Type | Log Events |
|
||||
| --------------------- | ---------------------------- |
|
||||
| Audit triggers | Failed to update audit trail |
|
||||
| Aggregation triggers | Calculation failed |
|
||||
| Cascade triggers | Related record lookup failed |
|
||||
| Notification triggers | External service call failed |
|
||||
|
||||
### 3. PostgreSQL Configuration
|
||||
|
||||
Enable logging in `postgresql.conf`:
|
||||
|
||||
```ini
|
||||
# Log all function notices and above
|
||||
log_min_messages = notice
|
||||
|
||||
# Include function name in log prefix
|
||||
log_line_prefix = '%t [%p] %u@%d '
|
||||
|
||||
# Log to file for Logstash pickup
|
||||
logging_collector = on
|
||||
log_directory = '/var/log/postgresql'
|
||||
log_filename = 'postgresql-%Y-%m-%d.log'
|
||||
log_rotation_age = 1d
|
||||
log_rotation_size = 100MB
|
||||
|
||||
# Capture slow queries from functions
|
||||
log_min_duration_statement = 1000 # Log queries over 1 second
|
||||
```
|
||||
|
||||
### 4. Logstash Integration
|
||||
|
||||
Update the Logstash pipeline (extends ADR-029 configuration):
|
||||
|
||||
```conf
|
||||
# PostgreSQL function log input
|
||||
input {
|
||||
file {
|
||||
path => "/var/log/postgresql/*.log"
|
||||
type => "postgres"
|
||||
tags => ["postgres"]
|
||||
start_position => "beginning"
|
||||
sincedb_path => "/var/lib/logstash/sincedb_postgres"
|
||||
}
|
||||
}
|
||||
|
||||
filter {
|
||||
if [type] == "postgres" {
|
||||
# Extract timestamp and process ID from PostgreSQL log prefix
|
||||
grok {
|
||||
match => { "message" => "%{TIMESTAMP_ISO8601:pg_timestamp} \[%{POSINT:pg_pid}\] %{USER:pg_user}@%{WORD:pg_database} %{GREEDYDATA:pg_message}" }
|
||||
}
|
||||
|
||||
# Check if this is a structured JSON log from fn_log()
|
||||
if [pg_message] =~ /^\{.*"source":"postgresql".*\}$/ {
|
||||
json {
|
||||
source => "pg_message"
|
||||
target => "fn_log"
|
||||
}
|
||||
|
||||
# Mark as error if level is WARNING or ERROR
|
||||
if [fn_log][level] in ["WARNING", "ERROR"] {
|
||||
mutate { add_tag => ["error", "db_function"] }
|
||||
}
|
||||
}
|
||||
|
||||
# Also catch native PostgreSQL errors
|
||||
if [pg_message] =~ /^ERROR:/ or [pg_message] =~ /^FATAL:/ {
|
||||
mutate { add_tag => ["error", "postgres_native"] }
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
output {
|
||||
if "error" in [tags] and "postgres" in [tags] {
|
||||
http {
|
||||
url => "http://localhost:8000/api/store/"
|
||||
http_method => "post"
|
||||
format => "json"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Dual-File Update Requirement
|
||||
|
||||
**IMPORTANT**: All SQL function changes must be applied to BOTH files:
|
||||
|
||||
1. `sql/Initial_triggers_and_functions.sql` - Used for incremental updates
|
||||
2. `sql/master_schema_rollup.sql` - Used for fresh database setup
|
||||
|
||||
Both files must remain in sync for triggers and functions.
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
1. **Create `fn_log()` helper function**:
|
||||
- Add to both SQL files
|
||||
- Test with `SELECT fn_log('INFO', 'test', 'Test message', '{"key": "value"}'::jsonb);`
|
||||
|
||||
2. **Update Tier 1 critical functions** (highest priority):
|
||||
- Identify functions with silent failures
|
||||
- Add appropriate logging calls
|
||||
- Test error paths
|
||||
|
||||
3. **Update Tier 2 business logic functions**:
|
||||
- Add anomaly logging to suggestion/recommendation functions
|
||||
- Log empty result sets with context
|
||||
|
||||
4. **Update Tier 3 trigger functions**:
|
||||
- Add error-only logging to critical triggers
|
||||
- Wrap complex trigger logic in exception handlers
|
||||
|
||||
5. **Configure PostgreSQL logging**:
|
||||
- Update `postgresql.conf` in dev container
|
||||
- Update production PostgreSQL configuration
|
||||
- Verify logs appear in expected location
|
||||
|
||||
6. **Update Logstash pipeline**:
|
||||
- Add PostgreSQL input to Logstash config
|
||||
- Add filter rules for structured JSON extraction
|
||||
- Test end-to-end: function log -> Logstash -> Bugsink
|
||||
|
||||
7. **Verify in Bugsink**:
|
||||
- Confirm database function errors appear as issues
|
||||
- Verify context (user_id, function name, params) is captured
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Visibility**: Silent failures become visible in error tracking
|
||||
- **Debugging**: Function execution context captured for root cause analysis
|
||||
- **Proactive detection**: Anomalies logged before users report issues
|
||||
- **Unified monitoring**: Database errors appear alongside application errors in Bugsink
|
||||
- **Structured logs**: JSON format enables filtering and aggregation
|
||||
|
||||
### Negative
|
||||
|
||||
- **Performance overhead**: Logging adds latency to function execution
|
||||
- **Log volume**: Tier 1/2 functions may generate significant log volume
|
||||
- **Maintenance**: Two SQL files must be kept in sync
|
||||
- **PostgreSQL configuration**: Requires access to `postgresql.conf`
|
||||
|
||||
### Mitigations
|
||||
|
||||
- **Performance**: Only log meaningful events, not every function call
|
||||
- **Log volume**: Use appropriate log levels; Logstash filters reduce noise
|
||||
- **Sync**: Add CI check to verify SQL files match for function definitions
|
||||
- **Configuration**: Document PostgreSQL settings in deployment runbook
|
||||
|
||||
## Examples
|
||||
|
||||
### Before (Silent Failure)
|
||||
|
||||
```sql
|
||||
-- User thinks operation completed, but it silently failed
|
||||
SELECT process_item('user-uuid', 'Nonexistent Item');
|
||||
-- Returns: void (no error, no log)
|
||||
-- Result: User never gets expected result, nobody knows why
|
||||
```
|
||||
|
||||
### After (Observable Failure)
|
||||
|
||||
```sql
|
||||
SELECT process_item('user-uuid', 'Nonexistent Item');
|
||||
-- Returns: void
|
||||
-- PostgreSQL log: {"timestamp":"2026-01-11T10:30:00Z","level":"WARNING","source":"postgresql","function":"process_item","message":"Item not found: Nonexistent Item","context":{"user_id":"user-uuid","item_name":"Nonexistent Item"}}
|
||||
-- Bugsink: New issue created with full context
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [ADR-029: Error Tracking with Bugsink](ADR-029-error-tracking-with-bugsink.md)
|
||||
- [ADR-027: Application-Wide Structured Logging](ADR-027-application-wide-structured-logging.md)
|
||||
- [PostgreSQL RAISE Documentation](https://www.postgresql.org/docs/current/plpgsql-errors-and-messages.html)
|
||||
- [PostgreSQL Logging Configuration](https://www.postgresql.org/docs/current/runtime-config-logging.html)
|
||||
262
docs/adr/ADR-031-granular-debug-logging-strategy.md
Normal file
262
docs/adr/ADR-031-granular-debug-logging-strategy.md
Normal file
@@ -0,0 +1,262 @@
|
||||
# ADR-031: Granular Debug Logging Strategy
|
||||
|
||||
**Date**: 2026-02-10
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
**Source**: Imported from flyer-crawler project (ADR-052)
|
||||
|
||||
**Related**: [ADR-027](ADR-027-application-wide-structured-logging.md), [ADR-017](ADR-017-structured-logging-with-pino.md)
|
||||
|
||||
## Context
|
||||
|
||||
Global log levels (INFO vs DEBUG) are too coarse. Developers need to inspect detailed debug information for specific subsystems (e.g., `ai-service`, `db-pool`, `auth-service`) without being flooded by logs from the entire application.
|
||||
|
||||
When debugging a specific feature:
|
||||
|
||||
- Setting `LOG_LEVEL=debug` globally produces too much noise
|
||||
- Manually adding/removing debug statements is error-prone
|
||||
- No standard way to enable targeted debugging in production
|
||||
|
||||
## Decision
|
||||
|
||||
We will adopt a namespace-based debug filter pattern, similar to the `debug` npm package, but integrated into our Pino logger.
|
||||
|
||||
1. **Logger Namespaces**: Every service/module logger must be initialized with a `module` property (e.g., `logger.child({ module: 'ai-service' })`).
|
||||
2. **Environment Filter**: We will support a `DEBUG_MODULES` environment variable that overrides the log level for matching modules.
|
||||
|
||||
## Implementation
|
||||
|
||||
### Core Implementation
|
||||
|
||||
Implemented in `src/services/logger.server.ts`:
|
||||
|
||||
```typescript
|
||||
import pino from 'pino';
|
||||
|
||||
// Parse DEBUG_MODULES from environment
|
||||
const debugModules = (process.env.DEBUG_MODULES || '').split(',').map((s) => s.trim());
|
||||
|
||||
// Base logger configuration
|
||||
export const logger = pino({
|
||||
level: process.env.LOG_LEVEL || (process.env.NODE_ENV === 'production' ? 'info' : 'debug'),
|
||||
// ... other configuration
|
||||
});
|
||||
|
||||
/**
|
||||
* Creates a scoped logger for a specific module.
|
||||
* If DEBUG_MODULES includes this module or '*', debug level is enabled.
|
||||
*/
|
||||
export const createScopedLogger = (moduleName: string) => {
|
||||
// If DEBUG_MODULES contains the module name or "*", force level to 'debug'
|
||||
const isDebugEnabled = debugModules.includes('*') || debugModules.includes(moduleName);
|
||||
|
||||
return logger.child({
|
||||
module: moduleName,
|
||||
level: isDebugEnabled ? 'debug' : logger.level,
|
||||
});
|
||||
};
|
||||
```
|
||||
|
||||
### Service Usage Examples
|
||||
|
||||
```typescript
|
||||
// src/services/aiService.server.ts
|
||||
import { createScopedLogger } from './logger.server';
|
||||
|
||||
const logger = createScopedLogger('ai-service');
|
||||
|
||||
export async function processWithAI(data: unknown) {
|
||||
logger.debug({ data }, 'Starting AI processing');
|
||||
// ... implementation
|
||||
logger.info({ result }, 'AI processing completed');
|
||||
}
|
||||
```
|
||||
|
||||
```typescript
|
||||
// src/services/authService.server.ts
|
||||
import { createScopedLogger } from './logger.server';
|
||||
|
||||
const logger = createScopedLogger('auth-service');
|
||||
|
||||
export async function validateToken(token: string) {
|
||||
logger.debug({ tokenLength: token.length }, 'Validating token');
|
||||
// ... implementation
|
||||
}
|
||||
```
|
||||
|
||||
### Module Naming Convention
|
||||
|
||||
Use kebab-case suffixed with `-service` or `-worker`:
|
||||
|
||||
| Module Name | Purpose | File |
|
||||
| --------------- | -------------------------------- | ------------------------------------- |
|
||||
| `ai-service` | AI/external API interactions | `src/services/aiService.server.ts` |
|
||||
| `auth-service` | Authentication and authorization | `src/services/authService.server.ts` |
|
||||
| `db-pool` | Database connection pooling | `src/services/database.server.ts` |
|
||||
| `cache-service` | Redis/caching operations | `src/services/cacheService.server.ts` |
|
||||
| `queue-worker` | Background job processing | `src/workers/queueWorker.ts` |
|
||||
| `email-service` | Email sending | `src/services/emailService.server.ts` |
|
||||
|
||||
## Usage
|
||||
|
||||
### Enable Debug Logging for Specific Modules
|
||||
|
||||
To debug only AI and authentication:
|
||||
|
||||
```bash
|
||||
DEBUG_MODULES=ai-service,auth-service npm run dev
|
||||
```
|
||||
|
||||
### Enable All Debug Logging
|
||||
|
||||
Use wildcard to enable debug logging for all modules:
|
||||
|
||||
```bash
|
||||
DEBUG_MODULES=* npm run dev
|
||||
```
|
||||
|
||||
### Development Environment
|
||||
|
||||
In `.env.development`:
|
||||
|
||||
```bash
|
||||
# Enable debug logging for specific modules during development
|
||||
DEBUG_MODULES=ai-service
|
||||
```
|
||||
|
||||
### Production Troubleshooting
|
||||
|
||||
Temporarily enable debug logging for a specific subsystem:
|
||||
|
||||
```bash
|
||||
# SSH into production server
|
||||
ssh root@example.com
|
||||
|
||||
# Set environment variable and restart
|
||||
DEBUG_MODULES=ai-service pm2 restart app-api
|
||||
|
||||
# View logs
|
||||
pm2 logs app-api --lines 100
|
||||
|
||||
# Disable debug logging
|
||||
pm2 unset DEBUG_MODULES app-api
|
||||
pm2 restart app-api
|
||||
```
|
||||
|
||||
### With PM2 Configuration
|
||||
|
||||
In `ecosystem.config.js`:
|
||||
|
||||
```javascript
|
||||
module.exports = {
|
||||
apps: [
|
||||
{
|
||||
name: 'app-api',
|
||||
script: 'dist/server.js',
|
||||
env: {
|
||||
NODE_ENV: 'production',
|
||||
// DEBUG_MODULES is unset by default
|
||||
},
|
||||
env_debug: {
|
||||
NODE_ENV: 'production',
|
||||
DEBUG_MODULES: 'ai-service,auth-service',
|
||||
},
|
||||
},
|
||||
],
|
||||
};
|
||||
```
|
||||
|
||||
Start with debug logging:
|
||||
|
||||
```bash
|
||||
pm2 start ecosystem.config.js --env debug
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Use Scoped Loggers for Long-Running Services
|
||||
|
||||
Services with complex workflows or external API calls should use `createScopedLogger` to allow targeted debugging:
|
||||
|
||||
```typescript
|
||||
const logger = createScopedLogger('payment-service');
|
||||
|
||||
export async function processPayment(payment: Payment) {
|
||||
logger.debug({ paymentId: payment.id }, 'Starting payment processing');
|
||||
|
||||
try {
|
||||
const result = await externalPaymentAPI.process(payment);
|
||||
logger.debug({ result }, 'External API response');
|
||||
return result;
|
||||
} catch (error) {
|
||||
logger.error({ error, paymentId: payment.id }, 'Payment processing failed');
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Use Child Loggers for Contextual Data
|
||||
|
||||
Even within scoped loggers, create child loggers with job/request-specific context:
|
||||
|
||||
```typescript
|
||||
const logger = createScopedLogger('queue-worker');
|
||||
|
||||
async function processJob(job: Job) {
|
||||
const jobLogger = logger.child({ jobId: job.id, jobName: job.name });
|
||||
|
||||
jobLogger.debug('Starting job processing');
|
||||
// ... processing
|
||||
jobLogger.info('Job completed successfully');
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Consistent Debug Message Patterns
|
||||
|
||||
Use consistent patterns for debug messages:
|
||||
|
||||
```typescript
|
||||
// Function entry
|
||||
logger.debug({ params: sanitizedParams }, 'Function entry: processOrder');
|
||||
|
||||
// External API calls
|
||||
logger.debug({ url, method }, 'External API request');
|
||||
logger.debug({ statusCode, duration }, 'External API response');
|
||||
|
||||
// State changes
|
||||
logger.debug({ before, after }, 'State transition');
|
||||
|
||||
// Decision points
|
||||
logger.debug({ condition, result }, 'Branch decision');
|
||||
```
|
||||
|
||||
### 4. Production Usage Guidelines
|
||||
|
||||
- `DEBUG_MODULES` can be set in production for temporary debugging
|
||||
- Should not be used continuously due to increased log volume
|
||||
- Always unset after troubleshooting is complete
|
||||
- Monitor log storage when debug logging is enabled
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- Developers can inspect detailed logs for specific subsystems without log flooding
|
||||
- Production debugging becomes more targeted and efficient
|
||||
- No performance impact when debug logging is disabled
|
||||
- Compatible with existing Pino logging infrastructure
|
||||
- Follows familiar pattern from `debug` npm package
|
||||
|
||||
### Negative
|
||||
|
||||
- Requires developers to know module names (mitigated by documentation)
|
||||
- Not all services have adopted scoped loggers yet (gradual migration)
|
||||
- Additional configuration complexity
|
||||
|
||||
## References
|
||||
|
||||
- [ADR-027: Application-Wide Structured Logging](ADR-027-application-wide-structured-logging.md)
|
||||
- [ADR-017: Structured Logging with Pino](ADR-017-structured-logging-with-pino.md)
|
||||
- [debug npm package](https://www.npmjs.com/package/debug) - Inspiration for namespace pattern
|
||||
- [Pino Child Loggers](https://getpino.io/#/docs/child-loggers)
|
||||
263
docs/adr/ADR-032-application-performance-monitoring.md
Normal file
263
docs/adr/ADR-032-application-performance-monitoring.md
Normal file
@@ -0,0 +1,263 @@
|
||||
# ADR-032: Application Performance Monitoring (APM)
|
||||
|
||||
**Date**: 2026-02-10
|
||||
|
||||
**Status**: Proposed
|
||||
|
||||
**Source**: Imported from flyer-crawler project (ADR-056)
|
||||
|
||||
**Related**: [ADR-029](ADR-029-error-tracking-with-bugsink.md) (Error Tracking with Bugsink)
|
||||
|
||||
## Context
|
||||
|
||||
Application Performance Monitoring (APM) provides visibility into application behavior through:
|
||||
|
||||
- **Distributed Tracing**: Track requests across services, queues, and database calls
|
||||
- **Performance Metrics**: Response times, throughput, error rates
|
||||
- **Resource Monitoring**: Memory usage, CPU, database connections
|
||||
- **Transaction Analysis**: Identify slow endpoints and bottlenecks
|
||||
|
||||
While ADR-029 covers error tracking and observability, APM is a distinct concern focused on performance rather than errors. The Sentry SDK supports APM through its tracing features, but this capability is currently **intentionally disabled** in our application.
|
||||
|
||||
### Current State
|
||||
|
||||
The Sentry SDK is installed and configured for error tracking (see ADR-029), but APM features are disabled:
|
||||
|
||||
```typescript
|
||||
// src/services/sentry.client.ts
|
||||
Sentry.init({
|
||||
dsn: config.sentry.dsn,
|
||||
environment: config.sentry.environment,
|
||||
// Performance monitoring - disabled for now to keep it simple
|
||||
tracesSampleRate: 0,
|
||||
// ...
|
||||
});
|
||||
```
|
||||
|
||||
```typescript
|
||||
// src/services/sentry.server.ts
|
||||
Sentry.init({
|
||||
dsn: config.sentry.dsn,
|
||||
environment: config.sentry.environment || config.server.nodeEnv,
|
||||
// Performance monitoring - disabled for now to keep it simple
|
||||
tracesSampleRate: 0,
|
||||
// ...
|
||||
});
|
||||
```
|
||||
|
||||
### Why APM is Currently Disabled
|
||||
|
||||
1. **Complexity**: APM adds overhead and complexity to debugging
|
||||
2. **Bugsink Limitations**: Bugsink's APM support is less mature than its error tracking
|
||||
3. **Resource Overhead**: Tracing adds memory and CPU overhead
|
||||
4. **Focus**: Error tracking provides more immediate value for our current scale
|
||||
5. **Cost**: High sample rates can significantly increase storage requirements
|
||||
|
||||
## Decision
|
||||
|
||||
We propose a **staged approach** to APM implementation:
|
||||
|
||||
### Phase 1: Selective Backend Tracing (Low Priority)
|
||||
|
||||
Enable tracing for specific high-value operations:
|
||||
|
||||
```typescript
|
||||
// Enable tracing for specific transactions only
|
||||
Sentry.init({
|
||||
dsn: config.sentry.dsn,
|
||||
tracesSampleRate: 0, // Keep default at 0
|
||||
|
||||
// Trace only specific high-value transactions
|
||||
tracesSampler: (samplingContext) => {
|
||||
const transactionName = samplingContext.transactionContext?.name;
|
||||
|
||||
// Always trace long-running jobs
|
||||
if (transactionName?.includes('job-processing')) {
|
||||
return 0.1; // 10% sample rate
|
||||
}
|
||||
|
||||
// Always trace AI/external API calls
|
||||
if (transactionName?.includes('external-api')) {
|
||||
return 0.5; // 50% sample rate
|
||||
}
|
||||
|
||||
// Trace slow endpoints (determined by custom logic)
|
||||
if (samplingContext.parentSampled) {
|
||||
return 0.1; // 10% for child transactions
|
||||
}
|
||||
|
||||
return 0; // Don't trace other transactions
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
### Phase 2: Custom Performance Metrics
|
||||
|
||||
Add custom metrics without full tracing overhead:
|
||||
|
||||
```typescript
|
||||
// Custom metric for slow database queries
|
||||
import { metrics } from '@sentry/node';
|
||||
|
||||
// In repository methods
|
||||
const startTime = performance.now();
|
||||
const result = await pool.query(sql, params);
|
||||
const duration = performance.now() - startTime;
|
||||
|
||||
metrics.distribution('db.query.duration', duration, {
|
||||
tags: { query_type: 'select', table: 'users' },
|
||||
});
|
||||
|
||||
if (duration > 1000) {
|
||||
logger.warn({ duration, sql }, 'Slow query detected');
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 3: Full APM Integration (Future)
|
||||
|
||||
When/if full APM is needed:
|
||||
|
||||
```typescript
|
||||
Sentry.init({
|
||||
dsn: config.sentry.dsn,
|
||||
tracesSampleRate: 0.1, // 10% of transactions
|
||||
profilesSampleRate: 0.1, // 10% of traced transactions get profiled
|
||||
|
||||
integrations: [
|
||||
// Database tracing
|
||||
Sentry.postgresIntegration(),
|
||||
// Redis tracing
|
||||
Sentry.redisIntegration(),
|
||||
// BullMQ job tracing (custom integration)
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### To Enable Basic APM
|
||||
|
||||
1. **Update Sentry Configuration**:
|
||||
- Set `tracesSampleRate` > 0 in `src/services/sentry.server.ts`
|
||||
- Set `tracesSampleRate` > 0 in `src/services/sentry.client.ts`
|
||||
- Add environment variable `SENTRY_TRACES_SAMPLE_RATE` (default: 0)
|
||||
|
||||
2. **Add Instrumentation**:
|
||||
- Enable automatic Express instrumentation
|
||||
- Add manual spans for BullMQ job processing
|
||||
- Add database query instrumentation
|
||||
|
||||
3. **Frontend Tracing**:
|
||||
- Add Browser Tracing integration
|
||||
- Configure page load and navigation tracing
|
||||
|
||||
4. **Environment Variables**:
|
||||
|
||||
```bash
|
||||
SENTRY_TRACES_SAMPLE_RATE=0.1 # 10% sampling
|
||||
SENTRY_PROFILES_SAMPLE_RATE=0 # Profiling disabled
|
||||
```
|
||||
|
||||
5. **Bugsink Configuration**:
|
||||
- Verify Bugsink supports performance data ingestion
|
||||
- Configure retention policies for performance data
|
||||
|
||||
### Configuration Changes Required
|
||||
|
||||
```typescript
|
||||
// src/config/env.ts - Add new config
|
||||
sentry: {
|
||||
dsn: env.SENTRY_DSN,
|
||||
environment: env.SENTRY_ENVIRONMENT,
|
||||
debug: env.SENTRY_DEBUG === 'true',
|
||||
tracesSampleRate: parseFloat(env.SENTRY_TRACES_SAMPLE_RATE || '0'),
|
||||
profilesSampleRate: parseFloat(env.SENTRY_PROFILES_SAMPLE_RATE || '0'),
|
||||
},
|
||||
```
|
||||
|
||||
```typescript
|
||||
// src/services/sentry.server.ts - Updated init
|
||||
Sentry.init({
|
||||
dsn: config.sentry.dsn,
|
||||
environment: config.sentry.environment,
|
||||
tracesSampleRate: config.sentry.tracesSampleRate,
|
||||
profilesSampleRate: config.sentry.profilesSampleRate,
|
||||
// ... rest of config
|
||||
});
|
||||
```
|
||||
|
||||
## Trade-offs
|
||||
|
||||
### Enabling APM
|
||||
|
||||
**Benefits**:
|
||||
|
||||
- Identify performance bottlenecks
|
||||
- Track distributed transactions across services
|
||||
- Profile slow endpoints
|
||||
- Monitor resource utilization trends
|
||||
|
||||
**Costs**:
|
||||
|
||||
- Increased memory usage (~5-15% overhead)
|
||||
- Additional CPU for trace processing
|
||||
- Increased storage in Bugsink/Sentry
|
||||
- More complex debugging (noise in traces)
|
||||
- Potential latency from tracing overhead
|
||||
|
||||
### Keeping APM Disabled
|
||||
|
||||
**Benefits**:
|
||||
|
||||
- Simpler operation and debugging
|
||||
- Lower resource overhead
|
||||
- Focused on error tracking (higher priority)
|
||||
- No additional storage costs
|
||||
|
||||
**Costs**:
|
||||
|
||||
- No automated performance insights
|
||||
- Manual profiling required for bottleneck detection
|
||||
- Limited visibility into slow transactions
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
1. **OpenTelemetry**: More vendor-neutral, but adds another dependency and complexity
|
||||
2. **Prometheus + Grafana**: Good for metrics, but doesn't provide distributed tracing
|
||||
3. **Jaeger/Zipkin**: Purpose-built for tracing, but requires additional infrastructure
|
||||
4. **New Relic/Datadog SaaS**: Full-featured but conflicts with self-hosted requirement
|
||||
|
||||
## Current Recommendation
|
||||
|
||||
**Keep APM disabled** (`tracesSampleRate: 0`) until:
|
||||
|
||||
1. Specific performance issues are identified that require tracing
|
||||
2. Bugsink's APM support is verified and tested
|
||||
3. Infrastructure can support the additional overhead
|
||||
4. There is a clear business need for performance visibility
|
||||
|
||||
When enabling APM becomes necessary, start with Phase 1 (selective tracing) to minimize overhead while gaining targeted insights.
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive (When Implemented)
|
||||
|
||||
- Automated identification of slow endpoints
|
||||
- Distributed trace visualization across async operations
|
||||
- Correlation between errors and performance issues
|
||||
- Proactive alerting on performance degradation
|
||||
|
||||
### Negative
|
||||
|
||||
- Additional infrastructure complexity
|
||||
- Storage overhead for trace data
|
||||
- Potential performance impact from tracing itself
|
||||
- Learning curve for trace analysis
|
||||
|
||||
## References
|
||||
|
||||
- [Sentry Performance Monitoring](https://docs.sentry.io/product/performance/)
|
||||
- [@sentry/node Performance](https://docs.sentry.io/platforms/javascript/guides/node/performance/)
|
||||
- [@sentry/react Performance](https://docs.sentry.io/platforms/javascript/guides/react/performance/)
|
||||
- [OpenTelemetry](https://opentelemetry.io/) (alternative approach)
|
||||
- [ADR-029: Error Tracking with Bugsink](ADR-029-error-tracking-with-bugsink.md)
|
||||
340
docs/adr/ADR-033-bugsink-gitea-issue-sync.md
Normal file
340
docs/adr/ADR-033-bugsink-gitea-issue-sync.md
Normal file
@@ -0,0 +1,340 @@
|
||||
# ADR-033: Bugsink to Gitea Issue Synchronization
|
||||
|
||||
**Date**: 2026-02-10
|
||||
|
||||
**Status**: Proposed
|
||||
|
||||
**Source**: Imported from flyer-crawler project (ADR-054)
|
||||
|
||||
**Related**: [ADR-029](ADR-029-error-tracking-with-bugsink.md), [ADR-012](ADR-012-bullmq-background-job-processing.md)
|
||||
|
||||
## Context
|
||||
|
||||
The application uses Bugsink (Sentry-compatible self-hosted error tracking) to capture runtime errors across multiple projects:
|
||||
|
||||
| Project Type | Environment | Description |
|
||||
| -------------- | ------------ | ---------------------------------------- |
|
||||
| Backend | Production | Main API server errors |
|
||||
| Backend | Test/Staging | Pre-production API errors |
|
||||
| Frontend | Production | Client-side JavaScript errors |
|
||||
| Frontend | Test/Staging | Pre-production frontend errors |
|
||||
| Infrastructure | Production | Infrastructure-level errors (Redis, PM2) |
|
||||
| Infrastructure | Test/Staging | Pre-production infrastructure errors |
|
||||
|
||||
Currently, errors remain in Bugsink until manually reviewed. There is no automated workflow to:
|
||||
|
||||
1. Create trackable tickets for errors
|
||||
2. Assign errors to developers
|
||||
3. Track resolution progress
|
||||
4. Prevent errors from being forgotten
|
||||
|
||||
## Decision
|
||||
|
||||
Implement an automated background worker that synchronizes unresolved Bugsink issues to Gitea as trackable tickets. The sync worker will:
|
||||
|
||||
1. **Run only on the test/staging server** (not production, not dev container)
|
||||
2. **Poll all Bugsink projects** for unresolved issues
|
||||
3. **Create Gitea issues** with full error context
|
||||
4. **Mark synced issues as resolved** in Bugsink (to prevent re-polling)
|
||||
5. **Track sync state in Redis** to ensure idempotency
|
||||
|
||||
### Why Test/Staging Only?
|
||||
|
||||
- The sync worker is a background service that needs API tokens for both Bugsink and Gitea
|
||||
- Running on test/staging provides a single sync point without duplicating infrastructure
|
||||
- All Bugsink projects (including production) are synced from this one worker
|
||||
- Production server stays focused on serving users, not running sync jobs
|
||||
|
||||
## Architecture
|
||||
|
||||
### Component Overview
|
||||
|
||||
```
|
||||
+-----------------------------------------------------------------------+
|
||||
| TEST/STAGING SERVER |
|
||||
| |
|
||||
| +------------------+ +------------------+ +-------------------+ |
|
||||
| | BullMQ Queue |--->| Sync Worker |--->| Redis DB 15 | |
|
||||
| | bugsink-sync | | (15min repeat) | | Sync State | |
|
||||
| +------------------+ +--------+---------+ +-------------------+ |
|
||||
| | |
|
||||
+-----------------------------------+------------------------------------+
|
||||
|
|
||||
+---------------+---------------+
|
||||
v v
|
||||
+------------------+ +------------------+
|
||||
| Bugsink | | Gitea |
|
||||
| (all projects) | | (1 repo) |
|
||||
+------------------+ +------------------+
|
||||
```
|
||||
|
||||
### Queue Configuration
|
||||
|
||||
| Setting | Value | Rationale |
|
||||
| --------------- | ---------------------- | -------------------------------------------- |
|
||||
| Queue Name | `bugsink-sync` | Follows existing naming pattern |
|
||||
| Repeat Interval | 15 minutes | Balances responsiveness with API rate limits |
|
||||
| Retry Attempts | 3 | Standard retry policy |
|
||||
| Backoff | Exponential (30s base) | Handles temporary API failures |
|
||||
| Concurrency | 1 | Serial processing prevents race conditions |
|
||||
|
||||
### Redis Database Allocation
|
||||
|
||||
| Database | Usage | Owner |
|
||||
| -------- | ------------------- | --------------- |
|
||||
| 0 | BullMQ (Production) | Existing queues |
|
||||
| 1 | BullMQ (Test) | Existing queues |
|
||||
| 2-14 | Reserved | Future use |
|
||||
| 15 | Bugsink Sync State | This feature |
|
||||
|
||||
### Redis Key Schema
|
||||
|
||||
```
|
||||
bugsink:synced:{bugsink_issue_id}
|
||||
+-- Value: JSON {
|
||||
gitea_issue_number: number,
|
||||
synced_at: ISO timestamp,
|
||||
project: string,
|
||||
title: string
|
||||
}
|
||||
```
|
||||
|
||||
### Gitea Labels
|
||||
|
||||
The following labels should be created in the repository:
|
||||
|
||||
| Label | Color | Purpose |
|
||||
| -------------------- | ------------------ | ---------------------------------- |
|
||||
| `bug:frontend` | #e11d48 (Red) | Frontend JavaScript/React errors |
|
||||
| `bug:backend` | #ea580c (Orange) | Backend Node.js/API errors |
|
||||
| `bug:infrastructure` | #7c3aed (Purple) | Infrastructure errors (Redis, PM2) |
|
||||
| `env:production` | #dc2626 (Dark Red) | Production environment |
|
||||
| `env:test` | #2563eb (Blue) | Test/staging environment |
|
||||
| `env:development` | #6b7280 (Gray) | Development environment |
|
||||
| `source:bugsink` | #10b981 (Green) | Auto-synced from Bugsink |
|
||||
|
||||
### Label Mapping
|
||||
|
||||
| Bugsink Project Type | Bug Label | Env Label |
|
||||
| --------------------- | ------------------ | -------------- |
|
||||
| backend (prod) | bug:backend | env:production |
|
||||
| backend (test) | bug:backend | env:test |
|
||||
| frontend (prod) | bug:frontend | env:production |
|
||||
| frontend (test) | bug:frontend | env:test |
|
||||
| infrastructure (prod) | bug:infrastructure | env:production |
|
||||
| infrastructure (test) | bug:infrastructure | env:test |
|
||||
|
||||
All synced issues also receive the `source:bugsink` label.
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### New Files
|
||||
|
||||
| File | Purpose |
|
||||
| -------------------------------------- | ------------------------------------------- |
|
||||
| `src/services/bugsinkSync.server.ts` | Core synchronization logic |
|
||||
| `src/services/bugsinkClient.server.ts` | HTTP client for Bugsink API |
|
||||
| `src/services/giteaClient.server.ts` | HTTP client for Gitea API |
|
||||
| `src/types/bugsink.ts` | TypeScript interfaces for Bugsink responses |
|
||||
| `src/routes/admin/bugsink-sync.ts` | Admin endpoints for manual trigger |
|
||||
|
||||
### Modified Files
|
||||
|
||||
| File | Changes |
|
||||
| -------------------------------- | ------------------------------------- |
|
||||
| `src/services/queues.server.ts` | Add `bugsinkSyncQueue` definition |
|
||||
| `src/services/workers.server.ts` | Add sync worker implementation |
|
||||
| `src/config/env.ts` | Add bugsink sync configuration schema |
|
||||
| `.env.example` | Document new environment variables |
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# Bugsink Configuration
|
||||
BUGSINK_URL=https://bugsink.example.com
|
||||
BUGSINK_API_TOKEN=... # Created via Django management command
|
||||
|
||||
# Gitea Configuration
|
||||
GITEA_URL=https://gitea.example.com
|
||||
GITEA_API_TOKEN=... # Personal access token with repo scope
|
||||
GITEA_OWNER=org-name
|
||||
GITEA_REPO=project-repo
|
||||
|
||||
# Sync Control
|
||||
BUGSINK_SYNC_ENABLED=false # Set true only in test environment
|
||||
BUGSINK_SYNC_INTERVAL=15 # Minutes between sync runs
|
||||
```
|
||||
|
||||
### Gitea Issue Template
|
||||
|
||||
```markdown
|
||||
## Error Details
|
||||
|
||||
| Field | Value |
|
||||
| ------------ | --------------- |
|
||||
| **Type** | {error_type} |
|
||||
| **Message** | {error_message} |
|
||||
| **Platform** | {platform} |
|
||||
| **Level** | {level} |
|
||||
|
||||
## Occurrence Statistics
|
||||
|
||||
- **First Seen**: {first_seen}
|
||||
- **Last Seen**: {last_seen}
|
||||
- **Total Occurrences**: {count}
|
||||
|
||||
## Request Context
|
||||
|
||||
- **URL**: {request_url}
|
||||
- **Additional Context**: {context}
|
||||
|
||||
## Stacktrace
|
||||
|
||||
<details>
|
||||
<summary>Click to expand</summary>
|
||||
|
||||
{stacktrace}
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
**Bugsink Issue**: {bugsink_url}
|
||||
**Project**: {project_slug}
|
||||
**Trace ID**: {trace_id}
|
||||
```
|
||||
|
||||
### Sync Workflow
|
||||
|
||||
```
|
||||
1. Worker triggered (every 15 min or manual)
|
||||
2. For each Bugsink project:
|
||||
a. List issues with status='unresolved'
|
||||
b. For each issue:
|
||||
i. Check Redis for existing sync record
|
||||
ii. If already synced -> skip
|
||||
iii. Fetch issue details + stacktrace
|
||||
iv. Create Gitea issue with labels
|
||||
v. Store sync record in Redis
|
||||
vi. Mark issue as 'resolved' in Bugsink
|
||||
3. Log summary (synced: N, skipped: N, failed: N)
|
||||
```
|
||||
|
||||
### Idempotency Guarantees
|
||||
|
||||
1. **Redis check before creation**: Prevents duplicate Gitea issues
|
||||
2. **Atomic Redis write after Gitea create**: Ensures state consistency
|
||||
3. **Query only unresolved issues**: Resolved issues won't appear in polls
|
||||
4. **No TTL on Redis keys**: Permanent sync history
|
||||
|
||||
## Admin Interface
|
||||
|
||||
### Manual Sync Endpoint
|
||||
|
||||
```
|
||||
POST /api/admin/bugsink/sync
|
||||
Authorization: Bearer {admin_jwt}
|
||||
|
||||
Response:
|
||||
{
|
||||
"success": true,
|
||||
"data": {
|
||||
"synced": 3,
|
||||
"skipped": 12,
|
||||
"failed": 0,
|
||||
"duration_ms": 2340
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Sync Status Endpoint
|
||||
|
||||
```
|
||||
GET /api/admin/bugsink/sync/status
|
||||
Authorization: Bearer {admin_jwt}
|
||||
|
||||
Response:
|
||||
{
|
||||
"success": true,
|
||||
"data": {
|
||||
"enabled": true,
|
||||
"last_run": "2026-01-17T10:30:00Z",
|
||||
"next_run": "2026-01-17T10:45:00Z",
|
||||
"total_synced": 47,
|
||||
"projects": [
|
||||
{ "slug": "backend-prod", "synced_count": 12 },
|
||||
...
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Core Infrastructure
|
||||
|
||||
- Add environment variables to `env.ts` schema
|
||||
- Create `BugsinkClient` service (HTTP client)
|
||||
- Create `GiteaClient` service (HTTP client)
|
||||
- Add Redis db 15 connection for sync tracking
|
||||
|
||||
### Phase 2: Sync Logic
|
||||
|
||||
- Create `BugsinkSyncService` with sync logic
|
||||
- Add `bugsink-sync` queue to `queues.server.ts`
|
||||
- Add sync worker to `workers.server.ts`
|
||||
- Create TypeScript types for API responses
|
||||
|
||||
### Phase 3: Integration
|
||||
|
||||
- Add admin endpoints for manual sync trigger
|
||||
- Update CI/CD with new secrets
|
||||
- Add secrets to repository settings
|
||||
- Test end-to-end in staging environment
|
||||
|
||||
### Phase 4: Documentation
|
||||
|
||||
- Update CLAUDE.md with sync information
|
||||
- Create operational runbook for sync issues
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
1. **Visibility**: All application errors become trackable tickets
|
||||
2. **Accountability**: Errors can be assigned to developers
|
||||
3. **History**: Complete audit trail of when errors were discovered and resolved
|
||||
4. **Integration**: Errors appear alongside feature work in Gitea
|
||||
5. **Automation**: No manual error triage required
|
||||
|
||||
### Negative
|
||||
|
||||
1. **API Dependencies**: Requires both Bugsink and Gitea APIs to be available
|
||||
2. **Token Management**: Additional secrets to manage in CI/CD
|
||||
3. **Potential Noise**: High-frequency errors could create many tickets (mitigated by Bugsink's issue grouping)
|
||||
4. **Single Point**: Sync only runs on test server (if test server is down, no sync occurs)
|
||||
|
||||
### Risks and Mitigations
|
||||
|
||||
| Risk | Mitigation |
|
||||
| ----------------------- | ------------------------------------------------- |
|
||||
| Bugsink API rate limits | 15-minute polling interval |
|
||||
| Gitea API rate limits | Sequential processing with delays |
|
||||
| Redis connection issues | Reuse existing connection patterns |
|
||||
| Duplicate issues | Redis tracking + idempotent checks |
|
||||
| Missing stacktrace | Graceful degradation (create issue without trace) |
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Bi-directional sync**: Update Bugsink when Gitea issue is closed
|
||||
2. **Smart deduplication**: Detect similar errors across projects
|
||||
3. **Priority mapping**: High occurrence count -> high priority label
|
||||
4. **Slack/Discord notifications**: Alert on new critical errors
|
||||
5. **Metrics dashboard**: Track error trends over time
|
||||
|
||||
## References
|
||||
|
||||
- [ADR-012: BullMQ Background Job Processing](ADR-012-bullmq-background-job-processing.md)
|
||||
- [ADR-029: Error Tracking with Bugsink](ADR-029-error-tracking-with-bugsink.md)
|
||||
- [Bugsink API Documentation](https://bugsink.com/docs/api/)
|
||||
- [Gitea API Documentation](https://docs.gitea.io/en-us/api-usage/)
|
||||
4
package-lock.json
generated
4
package-lock.json
generated
@@ -1,12 +1,12 @@
|
||||
{
|
||||
"name": "flyer-crawler",
|
||||
"version": "0.14.0",
|
||||
"version": "0.14.1",
|
||||
"lockfileVersion": 3,
|
||||
"requires": true,
|
||||
"packages": {
|
||||
"": {
|
||||
"name": "flyer-crawler",
|
||||
"version": "0.14.0",
|
||||
"version": "0.14.1",
|
||||
"dependencies": {
|
||||
"@bull-board/api": "^6.14.2",
|
||||
"@bull-board/express": "^6.14.2",
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
{
|
||||
"name": "flyer-crawler",
|
||||
"private": true,
|
||||
"version": "0.14.0",
|
||||
"version": "0.14.1",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"dev": "concurrently \"npm:start:dev\" \"vite\"",
|
||||
|
||||
@@ -44,7 +44,10 @@ export const useAppInitialization = () => {
|
||||
if (appVersion) {
|
||||
logger.info(`Application version: ${appVersion}`);
|
||||
const lastSeenVersion = localStorage.getItem('lastSeenVersion');
|
||||
if (appVersion !== lastSeenVersion) {
|
||||
const onboardingCompleted = localStorage.getItem('flyer_crawler_onboarding_completed');
|
||||
|
||||
// Only show "What's New" if onboarding tour has been completed
|
||||
if (appVersion !== lastSeenVersion && onboardingCompleted === 'true') {
|
||||
openModal('whatsNew');
|
||||
localStorage.setItem('lastSeenVersion', appVersion);
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user