Files
flyer-crawler.projectium.com/docs/adr/ADR-032-application-performance-monitoring.md
Torben Sorensen 4d323a51ca
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 49m39s
fix tour / whats new collision
2026-02-12 04:29:43 -08:00

7.6 KiB

ADR-032: Application Performance Monitoring (APM)

Date: 2026-02-10

Status: Proposed

Source: Imported from flyer-crawler project (ADR-056)

Related: ADR-029 (Error Tracking with Bugsink)

Context

Application Performance Monitoring (APM) provides visibility into application behavior through:

  • Distributed Tracing: Track requests across services, queues, and database calls
  • Performance Metrics: Response times, throughput, error rates
  • Resource Monitoring: Memory usage, CPU, database connections
  • Transaction Analysis: Identify slow endpoints and bottlenecks

While ADR-029 covers error tracking and observability, APM is a distinct concern focused on performance rather than errors. The Sentry SDK supports APM through its tracing features, but this capability is currently intentionally disabled in our application.

Current State

The Sentry SDK is installed and configured for error tracking (see ADR-029), but APM features are disabled:

// src/services/sentry.client.ts
Sentry.init({
  dsn: config.sentry.dsn,
  environment: config.sentry.environment,
  // Performance monitoring - disabled for now to keep it simple
  tracesSampleRate: 0,
  // ...
});
// src/services/sentry.server.ts
Sentry.init({
  dsn: config.sentry.dsn,
  environment: config.sentry.environment || config.server.nodeEnv,
  // Performance monitoring - disabled for now to keep it simple
  tracesSampleRate: 0,
  // ...
});

Why APM is Currently Disabled

  1. Complexity: APM adds overhead and complexity to debugging
  2. Bugsink Limitations: Bugsink's APM support is less mature than its error tracking
  3. Resource Overhead: Tracing adds memory and CPU overhead
  4. Focus: Error tracking provides more immediate value for our current scale
  5. Cost: High sample rates can significantly increase storage requirements

Decision

We propose a staged approach to APM implementation:

Phase 1: Selective Backend Tracing (Low Priority)

Enable tracing for specific high-value operations:

// Enable tracing for specific transactions only
Sentry.init({
  dsn: config.sentry.dsn,
  tracesSampleRate: 0, // Keep default at 0

  // Trace only specific high-value transactions
  tracesSampler: (samplingContext) => {
    const transactionName = samplingContext.transactionContext?.name;

    // Always trace long-running jobs
    if (transactionName?.includes('job-processing')) {
      return 0.1; // 10% sample rate
    }

    // Always trace AI/external API calls
    if (transactionName?.includes('external-api')) {
      return 0.5; // 50% sample rate
    }

    // Trace slow endpoints (determined by custom logic)
    if (samplingContext.parentSampled) {
      return 0.1; // 10% for child transactions
    }

    return 0; // Don't trace other transactions
  },
});

Phase 2: Custom Performance Metrics

Add custom metrics without full tracing overhead:

// Custom metric for slow database queries
import { metrics } from '@sentry/node';

// In repository methods
const startTime = performance.now();
const result = await pool.query(sql, params);
const duration = performance.now() - startTime;

metrics.distribution('db.query.duration', duration, {
  tags: { query_type: 'select', table: 'users' },
});

if (duration > 1000) {
  logger.warn({ duration, sql }, 'Slow query detected');
}

Phase 3: Full APM Integration (Future)

When/if full APM is needed:

Sentry.init({
  dsn: config.sentry.dsn,
  tracesSampleRate: 0.1, // 10% of transactions
  profilesSampleRate: 0.1, // 10% of traced transactions get profiled

  integrations: [
    // Database tracing
    Sentry.postgresIntegration(),
    // Redis tracing
    Sentry.redisIntegration(),
    // BullMQ job tracing (custom integration)
  ],
});

Implementation Steps

To Enable Basic APM

  1. Update Sentry Configuration:

    • Set tracesSampleRate > 0 in src/services/sentry.server.ts
    • Set tracesSampleRate > 0 in src/services/sentry.client.ts
    • Add environment variable SENTRY_TRACES_SAMPLE_RATE (default: 0)
  2. Add Instrumentation:

    • Enable automatic Express instrumentation
    • Add manual spans for BullMQ job processing
    • Add database query instrumentation
  3. Frontend Tracing:

    • Add Browser Tracing integration
    • Configure page load and navigation tracing
  4. Environment Variables:

    SENTRY_TRACES_SAMPLE_RATE=0.1  # 10% sampling
    SENTRY_PROFILES_SAMPLE_RATE=0  # Profiling disabled
    
  5. Bugsink Configuration:

    • Verify Bugsink supports performance data ingestion
    • Configure retention policies for performance data

Configuration Changes Required

// src/config/env.ts - Add new config
sentry: {
  dsn: env.SENTRY_DSN,
  environment: env.SENTRY_ENVIRONMENT,
  debug: env.SENTRY_DEBUG === 'true',
  tracesSampleRate: parseFloat(env.SENTRY_TRACES_SAMPLE_RATE || '0'),
  profilesSampleRate: parseFloat(env.SENTRY_PROFILES_SAMPLE_RATE || '0'),
},
// src/services/sentry.server.ts - Updated init
Sentry.init({
  dsn: config.sentry.dsn,
  environment: config.sentry.environment,
  tracesSampleRate: config.sentry.tracesSampleRate,
  profilesSampleRate: config.sentry.profilesSampleRate,
  // ... rest of config
});

Trade-offs

Enabling APM

Benefits:

  • Identify performance bottlenecks
  • Track distributed transactions across services
  • Profile slow endpoints
  • Monitor resource utilization trends

Costs:

  • Increased memory usage (~5-15% overhead)
  • Additional CPU for trace processing
  • Increased storage in Bugsink/Sentry
  • More complex debugging (noise in traces)
  • Potential latency from tracing overhead

Keeping APM Disabled

Benefits:

  • Simpler operation and debugging
  • Lower resource overhead
  • Focused on error tracking (higher priority)
  • No additional storage costs

Costs:

  • No automated performance insights
  • Manual profiling required for bottleneck detection
  • Limited visibility into slow transactions

Alternatives Considered

  1. OpenTelemetry: More vendor-neutral, but adds another dependency and complexity
  2. Prometheus + Grafana: Good for metrics, but doesn't provide distributed tracing
  3. Jaeger/Zipkin: Purpose-built for tracing, but requires additional infrastructure
  4. New Relic/Datadog SaaS: Full-featured but conflicts with self-hosted requirement

Current Recommendation

Keep APM disabled (tracesSampleRate: 0) until:

  1. Specific performance issues are identified that require tracing
  2. Bugsink's APM support is verified and tested
  3. Infrastructure can support the additional overhead
  4. There is a clear business need for performance visibility

When enabling APM becomes necessary, start with Phase 1 (selective tracing) to minimize overhead while gaining targeted insights.

Consequences

Positive (When Implemented)

  • Automated identification of slow endpoints
  • Distributed trace visualization across async operations
  • Correlation between errors and performance issues
  • Proactive alerting on performance degradation

Negative

  • Additional infrastructure complexity
  • Storage overhead for trace data
  • Potential performance impact from tracing itself
  • Learning curve for trace analysis

References