Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 16m58s
151 lines
4.7 KiB
Markdown
151 lines
4.7 KiB
Markdown
# ADR-030: Graceful Degradation and Circuit Breaker Pattern
|
|
|
|
**Date**: 2026-01-09
|
|
|
|
**Status**: Proposed
|
|
|
|
## Context
|
|
|
|
The application depends on several external services:
|
|
|
|
1. **AI Services** (Google Gemini) - For flyer item extraction
|
|
2. **Redis** - For caching, rate limiting, and job queues
|
|
3. **PostgreSQL** - Primary data store
|
|
4. **Geocoding APIs** - For location services
|
|
|
|
Currently, when these services fail:
|
|
|
|
- AI failures may cause the entire upload to fail
|
|
- Redis unavailability could crash the application or bypass rate limiting
|
|
- No circuit breakers prevent repeated calls to failing services
|
|
- No fallback behaviors are defined
|
|
|
|
This creates fragility where a single service outage can cascade into application-wide failures.
|
|
|
|
## Decision
|
|
|
|
We will implement a graceful degradation strategy with circuit breakers for external service dependencies.
|
|
|
|
### 1. Circuit Breaker Pattern
|
|
|
|
Implement circuit breakers for external service calls using a library like `opossum`:
|
|
|
|
```typescript
|
|
import CircuitBreaker from 'opossum';
|
|
|
|
const aiCircuitBreaker = new CircuitBreaker(callAiService, {
|
|
timeout: 30000, // 30 second timeout
|
|
errorThresholdPercentage: 50, // Open circuit at 50% failures
|
|
resetTimeout: 30000, // Try again after 30 seconds
|
|
volumeThreshold: 5, // Minimum calls before calculating error %
|
|
});
|
|
|
|
aiCircuitBreaker.on('open', () => {
|
|
logger.warn('AI service circuit breaker opened');
|
|
});
|
|
|
|
aiCircuitBreaker.on('halfOpen', () => {
|
|
logger.info('AI service circuit breaker half-open, testing...');
|
|
});
|
|
```
|
|
|
|
### 2. Fallback Behaviors by Service
|
|
|
|
| Service | Fallback Behavior |
|
|
| ---------------------- | ---------------------------------------- |
|
|
| **Redis (Cache)** | Skip cache, query database directly |
|
|
| **Redis (Rate Limit)** | Log warning, allow request (fail-open) |
|
|
| **Redis (Queues)** | Queue to memory, process synchronously |
|
|
| **AI Service** | Return partial results, queue for retry |
|
|
| **Geocoding** | Return null location, allow manual entry |
|
|
| **PostgreSQL** | No fallback - critical dependency |
|
|
|
|
### 3. Health Status Aggregation
|
|
|
|
Extend health checks (ADR-020) to report service-level health:
|
|
|
|
```typescript
|
|
// GET /api/health/ready response
|
|
{
|
|
"status": "degraded", // healthy | degraded | unhealthy
|
|
"services": {
|
|
"database": { "status": "healthy", "latency": 5 },
|
|
"redis": { "status": "healthy", "latency": 2 },
|
|
"ai": { "status": "degraded", "circuitState": "half-open" },
|
|
"geocoding": { "status": "healthy", "latency": 150 }
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4. Retry Strategies
|
|
|
|
Define retry policies for transient failures:
|
|
|
|
```typescript
|
|
const retryConfig = {
|
|
ai: { maxRetries: 3, backoff: 'exponential', initialDelay: 1000 },
|
|
geocoding: { maxRetries: 2, backoff: 'linear', initialDelay: 500 },
|
|
database: { maxRetries: 3, backoff: 'exponential', initialDelay: 100 },
|
|
};
|
|
```
|
|
|
|
## Implementation Approach
|
|
|
|
### Phase 1: Redis Fallbacks
|
|
|
|
- Wrap cache operations with try-catch (already partially done in cacheService)
|
|
- Add fail-open for rate limiting when Redis is down
|
|
- Log degraded state
|
|
|
|
### Phase 2: AI Circuit Breaker
|
|
|
|
- Wrap AI service calls with circuit breaker
|
|
- Implement queue-for-retry on circuit open
|
|
- Add manual fallback UI for failed extractions
|
|
|
|
### Phase 3: Health Aggregation
|
|
|
|
- Update health endpoints with service status
|
|
- Add Prometheus-compatible metrics
|
|
- Create dashboard for service health
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- **Resilience**: Application continues functioning during partial outages
|
|
- **User Experience**: Degraded but functional is better than complete failure
|
|
- **Observability**: Clear visibility into service health
|
|
- **Protection**: Circuit breakers prevent cascading failures
|
|
|
|
### Negative
|
|
|
|
- **Complexity**: Additional code for fallback logic
|
|
- **Testing**: Requires testing failure scenarios
|
|
- **Consistency**: Some operations may have different results during degradation
|
|
|
|
## Implementation Status
|
|
|
|
### What's Implemented
|
|
|
|
- ✅ Cache operations fail gracefully (cacheService.server.ts)
|
|
- ❌ Circuit breakers for AI services
|
|
- ❌ Rate limit fail-open behavior
|
|
- ❌ Health aggregation endpoint
|
|
- ❌ Retry strategies with backoff
|
|
|
|
### What Needs To Be Done
|
|
|
|
1. Install and configure `opossum` circuit breaker library
|
|
2. Wrap AI service calls with circuit breaker
|
|
3. Add fail-open to rate limiting
|
|
4. Extend health endpoints with service status
|
|
5. Document degraded mode behaviors
|
|
|
|
## Key Files
|
|
|
|
- `src/utils/circuitBreaker.ts` - Circuit breaker configurations (to create)
|
|
- `src/services/cacheService.server.ts` - Already has graceful fallbacks
|
|
- `src/routes/health.routes.ts` - Health check endpoints (to extend)
|
|
- `src/services/aiService.server.ts` - AI service wrapper (to wrap)
|