Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 16m58s
4.7 KiB
4.7 KiB
ADR-030: Graceful Degradation and Circuit Breaker Pattern
Date: 2026-01-09
Status: Proposed
Context
The application depends on several external services:
- AI Services (Google Gemini) - For flyer item extraction
- Redis - For caching, rate limiting, and job queues
- PostgreSQL - Primary data store
- Geocoding APIs - For location services
Currently, when these services fail:
- AI failures may cause the entire upload to fail
- Redis unavailability could crash the application or bypass rate limiting
- No circuit breakers prevent repeated calls to failing services
- No fallback behaviors are defined
This creates fragility where a single service outage can cascade into application-wide failures.
Decision
We will implement a graceful degradation strategy with circuit breakers for external service dependencies.
1. Circuit Breaker Pattern
Implement circuit breakers for external service calls using a library like opossum:
import CircuitBreaker from 'opossum';
const aiCircuitBreaker = new CircuitBreaker(callAiService, {
timeout: 30000, // 30 second timeout
errorThresholdPercentage: 50, // Open circuit at 50% failures
resetTimeout: 30000, // Try again after 30 seconds
volumeThreshold: 5, // Minimum calls before calculating error %
});
aiCircuitBreaker.on('open', () => {
logger.warn('AI service circuit breaker opened');
});
aiCircuitBreaker.on('halfOpen', () => {
logger.info('AI service circuit breaker half-open, testing...');
});
2. Fallback Behaviors by Service
| Service | Fallback Behavior |
|---|---|
| Redis (Cache) | Skip cache, query database directly |
| Redis (Rate Limit) | Log warning, allow request (fail-open) |
| Redis (Queues) | Queue to memory, process synchronously |
| AI Service | Return partial results, queue for retry |
| Geocoding | Return null location, allow manual entry |
| PostgreSQL | No fallback - critical dependency |
3. Health Status Aggregation
Extend health checks (ADR-020) to report service-level health:
// GET /api/health/ready response
{
"status": "degraded", // healthy | degraded | unhealthy
"services": {
"database": { "status": "healthy", "latency": 5 },
"redis": { "status": "healthy", "latency": 2 },
"ai": { "status": "degraded", "circuitState": "half-open" },
"geocoding": { "status": "healthy", "latency": 150 }
}
}
4. Retry Strategies
Define retry policies for transient failures:
const retryConfig = {
ai: { maxRetries: 3, backoff: 'exponential', initialDelay: 1000 },
geocoding: { maxRetries: 2, backoff: 'linear', initialDelay: 500 },
database: { maxRetries: 3, backoff: 'exponential', initialDelay: 100 },
};
Implementation Approach
Phase 1: Redis Fallbacks
- Wrap cache operations with try-catch (already partially done in cacheService)
- Add fail-open for rate limiting when Redis is down
- Log degraded state
Phase 2: AI Circuit Breaker
- Wrap AI service calls with circuit breaker
- Implement queue-for-retry on circuit open
- Add manual fallback UI for failed extractions
Phase 3: Health Aggregation
- Update health endpoints with service status
- Add Prometheus-compatible metrics
- Create dashboard for service health
Consequences
Positive
- Resilience: Application continues functioning during partial outages
- User Experience: Degraded but functional is better than complete failure
- Observability: Clear visibility into service health
- Protection: Circuit breakers prevent cascading failures
Negative
- Complexity: Additional code for fallback logic
- Testing: Requires testing failure scenarios
- Consistency: Some operations may have different results during degradation
Implementation Status
What's Implemented
- ✅ Cache operations fail gracefully (cacheService.server.ts)
- ❌ Circuit breakers for AI services
- ❌ Rate limit fail-open behavior
- ❌ Health aggregation endpoint
- ❌ Retry strategies with backoff
What Needs To Be Done
- Install and configure
opossumcircuit breaker library - Wrap AI service calls with circuit breaker
- Add fail-open to rate limiting
- Extend health endpoints with service status
- Document degraded mode behaviors
Key Files
src/utils/circuitBreaker.ts- Circuit breaker configurations (to create)src/services/cacheService.server.ts- Already has graceful fallbackssrc/routes/health.routes.ts- Health check endpoints (to extend)src/services/aiService.server.ts- AI service wrapper (to wrap)