Files
flyer-crawler.projectium.com/docs/adr/0030-graceful-degradation-and-circuit-breaker.md
Torben Sorensen 4a04e478c4
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 16m58s
integration test fixes - claude for the win? try 4 - i have a good feeling
2026-01-09 05:56:19 -08:00

4.7 KiB

ADR-030: Graceful Degradation and Circuit Breaker Pattern

Date: 2026-01-09

Status: Proposed

Context

The application depends on several external services:

  1. AI Services (Google Gemini) - For flyer item extraction
  2. Redis - For caching, rate limiting, and job queues
  3. PostgreSQL - Primary data store
  4. Geocoding APIs - For location services

Currently, when these services fail:

  • AI failures may cause the entire upload to fail
  • Redis unavailability could crash the application or bypass rate limiting
  • No circuit breakers prevent repeated calls to failing services
  • No fallback behaviors are defined

This creates fragility where a single service outage can cascade into application-wide failures.

Decision

We will implement a graceful degradation strategy with circuit breakers for external service dependencies.

1. Circuit Breaker Pattern

Implement circuit breakers for external service calls using a library like opossum:

import CircuitBreaker from 'opossum';

const aiCircuitBreaker = new CircuitBreaker(callAiService, {
  timeout: 30000, // 30 second timeout
  errorThresholdPercentage: 50, // Open circuit at 50% failures
  resetTimeout: 30000, // Try again after 30 seconds
  volumeThreshold: 5, // Minimum calls before calculating error %
});

aiCircuitBreaker.on('open', () => {
  logger.warn('AI service circuit breaker opened');
});

aiCircuitBreaker.on('halfOpen', () => {
  logger.info('AI service circuit breaker half-open, testing...');
});

2. Fallback Behaviors by Service

Service Fallback Behavior
Redis (Cache) Skip cache, query database directly
Redis (Rate Limit) Log warning, allow request (fail-open)
Redis (Queues) Queue to memory, process synchronously
AI Service Return partial results, queue for retry
Geocoding Return null location, allow manual entry
PostgreSQL No fallback - critical dependency

3. Health Status Aggregation

Extend health checks (ADR-020) to report service-level health:

// GET /api/health/ready response
{
  "status": "degraded",  // healthy | degraded | unhealthy
  "services": {
    "database": { "status": "healthy", "latency": 5 },
    "redis": { "status": "healthy", "latency": 2 },
    "ai": { "status": "degraded", "circuitState": "half-open" },
    "geocoding": { "status": "healthy", "latency": 150 }
  }
}

4. Retry Strategies

Define retry policies for transient failures:

const retryConfig = {
  ai: { maxRetries: 3, backoff: 'exponential', initialDelay: 1000 },
  geocoding: { maxRetries: 2, backoff: 'linear', initialDelay: 500 },
  database: { maxRetries: 3, backoff: 'exponential', initialDelay: 100 },
};

Implementation Approach

Phase 1: Redis Fallbacks

  • Wrap cache operations with try-catch (already partially done in cacheService)
  • Add fail-open for rate limiting when Redis is down
  • Log degraded state

Phase 2: AI Circuit Breaker

  • Wrap AI service calls with circuit breaker
  • Implement queue-for-retry on circuit open
  • Add manual fallback UI for failed extractions

Phase 3: Health Aggregation

  • Update health endpoints with service status
  • Add Prometheus-compatible metrics
  • Create dashboard for service health

Consequences

Positive

  • Resilience: Application continues functioning during partial outages
  • User Experience: Degraded but functional is better than complete failure
  • Observability: Clear visibility into service health
  • Protection: Circuit breakers prevent cascading failures

Negative

  • Complexity: Additional code for fallback logic
  • Testing: Requires testing failure scenarios
  • Consistency: Some operations may have different results during degradation

Implementation Status

What's Implemented

  • Cache operations fail gracefully (cacheService.server.ts)
  • Circuit breakers for AI services
  • Rate limit fail-open behavior
  • Health aggregation endpoint
  • Retry strategies with backoff

What Needs To Be Done

  1. Install and configure opossum circuit breaker library
  2. Wrap AI service calls with circuit breaker
  3. Add fail-open to rate limiting
  4. Extend health endpoints with service status
  5. Document degraded mode behaviors

Key Files

  • src/utils/circuitBreaker.ts - Circuit breaker configurations (to create)
  • src/services/cacheService.server.ts - Already has graceful fallbacks
  • src/routes/health.routes.ts - Health check endpoints (to extend)
  • src/services/aiService.server.ts - AI service wrapper (to wrap)