# ADR-030: Graceful Degradation and Circuit Breaker Pattern **Date**: 2026-01-09 **Status**: Proposed ## Context The application depends on several external services: 1. **AI Services** (Google Gemini) - For flyer item extraction 2. **Redis** - For caching, rate limiting, and job queues 3. **PostgreSQL** - Primary data store 4. **Geocoding APIs** - For location services Currently, when these services fail: - AI failures may cause the entire upload to fail - Redis unavailability could crash the application or bypass rate limiting - No circuit breakers prevent repeated calls to failing services - No fallback behaviors are defined This creates fragility where a single service outage can cascade into application-wide failures. ## Decision We will implement a graceful degradation strategy with circuit breakers for external service dependencies. ### 1. Circuit Breaker Pattern Implement circuit breakers for external service calls using a library like `opossum`: ```typescript import CircuitBreaker from 'opossum'; const aiCircuitBreaker = new CircuitBreaker(callAiService, { timeout: 30000, // 30 second timeout errorThresholdPercentage: 50, // Open circuit at 50% failures resetTimeout: 30000, // Try again after 30 seconds volumeThreshold: 5, // Minimum calls before calculating error % }); aiCircuitBreaker.on('open', () => { logger.warn('AI service circuit breaker opened'); }); aiCircuitBreaker.on('halfOpen', () => { logger.info('AI service circuit breaker half-open, testing...'); }); ``` ### 2. Fallback Behaviors by Service | Service | Fallback Behavior | | ---------------------- | ---------------------------------------- | | **Redis (Cache)** | Skip cache, query database directly | | **Redis (Rate Limit)** | Log warning, allow request (fail-open) | | **Redis (Queues)** | Queue to memory, process synchronously | | **AI Service** | Return partial results, queue for retry | | **Geocoding** | Return null location, allow manual entry | | **PostgreSQL** | No fallback - critical dependency | ### 3. Health Status Aggregation Extend health checks (ADR-020) to report service-level health: ```typescript // GET /api/health/ready response { "status": "degraded", // healthy | degraded | unhealthy "services": { "database": { "status": "healthy", "latency": 5 }, "redis": { "status": "healthy", "latency": 2 }, "ai": { "status": "degraded", "circuitState": "half-open" }, "geocoding": { "status": "healthy", "latency": 150 } } } ``` ### 4. Retry Strategies Define retry policies for transient failures: ```typescript const retryConfig = { ai: { maxRetries: 3, backoff: 'exponential', initialDelay: 1000 }, geocoding: { maxRetries: 2, backoff: 'linear', initialDelay: 500 }, database: { maxRetries: 3, backoff: 'exponential', initialDelay: 100 }, }; ``` ## Implementation Approach ### Phase 1: Redis Fallbacks - Wrap cache operations with try-catch (already partially done in cacheService) - Add fail-open for rate limiting when Redis is down - Log degraded state ### Phase 2: AI Circuit Breaker - Wrap AI service calls with circuit breaker - Implement queue-for-retry on circuit open - Add manual fallback UI for failed extractions ### Phase 3: Health Aggregation - Update health endpoints with service status - Add Prometheus-compatible metrics - Create dashboard for service health ## Consequences ### Positive - **Resilience**: Application continues functioning during partial outages - **User Experience**: Degraded but functional is better than complete failure - **Observability**: Clear visibility into service health - **Protection**: Circuit breakers prevent cascading failures ### Negative - **Complexity**: Additional code for fallback logic - **Testing**: Requires testing failure scenarios - **Consistency**: Some operations may have different results during degradation ## Implementation Status ### What's Implemented - ✅ Cache operations fail gracefully (cacheService.server.ts) - ❌ Circuit breakers for AI services - ❌ Rate limit fail-open behavior - ❌ Health aggregation endpoint - ❌ Retry strategies with backoff ### What Needs To Be Done 1. Install and configure `opossum` circuit breaker library 2. Wrap AI service calls with circuit breaker 3. Add fail-open to rate limiting 4. Extend health endpoints with service status 5. Document degraded mode behaviors ## Key Files - `src/utils/circuitBreaker.ts` - Circuit breaker configurations (to create) - `src/services/cacheService.server.ts` - Already has graceful fallbacks - `src/routes/health.routes.ts` - Health check endpoints (to extend) - `src/services/aiService.server.ts` - AI service wrapper (to wrap)