integration test fixes - claude for the win? try 4 - i have a good feeling
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 16m58s
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 16m58s
This commit is contained in:
150
docs/adr/0030-graceful-degradation-and-circuit-breaker.md
Normal file
150
docs/adr/0030-graceful-degradation-and-circuit-breaker.md
Normal file
@@ -0,0 +1,150 @@
|
||||
# ADR-030: Graceful Degradation and Circuit Breaker Pattern
|
||||
|
||||
**Date**: 2026-01-09
|
||||
|
||||
**Status**: Proposed
|
||||
|
||||
## Context
|
||||
|
||||
The application depends on several external services:
|
||||
|
||||
1. **AI Services** (Google Gemini) - For flyer item extraction
|
||||
2. **Redis** - For caching, rate limiting, and job queues
|
||||
3. **PostgreSQL** - Primary data store
|
||||
4. **Geocoding APIs** - For location services
|
||||
|
||||
Currently, when these services fail:
|
||||
|
||||
- AI failures may cause the entire upload to fail
|
||||
- Redis unavailability could crash the application or bypass rate limiting
|
||||
- No circuit breakers prevent repeated calls to failing services
|
||||
- No fallback behaviors are defined
|
||||
|
||||
This creates fragility where a single service outage can cascade into application-wide failures.
|
||||
|
||||
## Decision
|
||||
|
||||
We will implement a graceful degradation strategy with circuit breakers for external service dependencies.
|
||||
|
||||
### 1. Circuit Breaker Pattern
|
||||
|
||||
Implement circuit breakers for external service calls using a library like `opossum`:
|
||||
|
||||
```typescript
|
||||
import CircuitBreaker from 'opossum';
|
||||
|
||||
const aiCircuitBreaker = new CircuitBreaker(callAiService, {
|
||||
timeout: 30000, // 30 second timeout
|
||||
errorThresholdPercentage: 50, // Open circuit at 50% failures
|
||||
resetTimeout: 30000, // Try again after 30 seconds
|
||||
volumeThreshold: 5, // Minimum calls before calculating error %
|
||||
});
|
||||
|
||||
aiCircuitBreaker.on('open', () => {
|
||||
logger.warn('AI service circuit breaker opened');
|
||||
});
|
||||
|
||||
aiCircuitBreaker.on('halfOpen', () => {
|
||||
logger.info('AI service circuit breaker half-open, testing...');
|
||||
});
|
||||
```
|
||||
|
||||
### 2. Fallback Behaviors by Service
|
||||
|
||||
| Service | Fallback Behavior |
|
||||
| ---------------------- | ---------------------------------------- |
|
||||
| **Redis (Cache)** | Skip cache, query database directly |
|
||||
| **Redis (Rate Limit)** | Log warning, allow request (fail-open) |
|
||||
| **Redis (Queues)** | Queue to memory, process synchronously |
|
||||
| **AI Service** | Return partial results, queue for retry |
|
||||
| **Geocoding** | Return null location, allow manual entry |
|
||||
| **PostgreSQL** | No fallback - critical dependency |
|
||||
|
||||
### 3. Health Status Aggregation
|
||||
|
||||
Extend health checks (ADR-020) to report service-level health:
|
||||
|
||||
```typescript
|
||||
// GET /api/health/ready response
|
||||
{
|
||||
"status": "degraded", // healthy | degraded | unhealthy
|
||||
"services": {
|
||||
"database": { "status": "healthy", "latency": 5 },
|
||||
"redis": { "status": "healthy", "latency": 2 },
|
||||
"ai": { "status": "degraded", "circuitState": "half-open" },
|
||||
"geocoding": { "status": "healthy", "latency": 150 }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Retry Strategies
|
||||
|
||||
Define retry policies for transient failures:
|
||||
|
||||
```typescript
|
||||
const retryConfig = {
|
||||
ai: { maxRetries: 3, backoff: 'exponential', initialDelay: 1000 },
|
||||
geocoding: { maxRetries: 2, backoff: 'linear', initialDelay: 500 },
|
||||
database: { maxRetries: 3, backoff: 'exponential', initialDelay: 100 },
|
||||
};
|
||||
```
|
||||
|
||||
## Implementation Approach
|
||||
|
||||
### Phase 1: Redis Fallbacks
|
||||
|
||||
- Wrap cache operations with try-catch (already partially done in cacheService)
|
||||
- Add fail-open for rate limiting when Redis is down
|
||||
- Log degraded state
|
||||
|
||||
### Phase 2: AI Circuit Breaker
|
||||
|
||||
- Wrap AI service calls with circuit breaker
|
||||
- Implement queue-for-retry on circuit open
|
||||
- Add manual fallback UI for failed extractions
|
||||
|
||||
### Phase 3: Health Aggregation
|
||||
|
||||
- Update health endpoints with service status
|
||||
- Add Prometheus-compatible metrics
|
||||
- Create dashboard for service health
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Resilience**: Application continues functioning during partial outages
|
||||
- **User Experience**: Degraded but functional is better than complete failure
|
||||
- **Observability**: Clear visibility into service health
|
||||
- **Protection**: Circuit breakers prevent cascading failures
|
||||
|
||||
### Negative
|
||||
|
||||
- **Complexity**: Additional code for fallback logic
|
||||
- **Testing**: Requires testing failure scenarios
|
||||
- **Consistency**: Some operations may have different results during degradation
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### What's Implemented
|
||||
|
||||
- ✅ Cache operations fail gracefully (cacheService.server.ts)
|
||||
- ❌ Circuit breakers for AI services
|
||||
- ❌ Rate limit fail-open behavior
|
||||
- ❌ Health aggregation endpoint
|
||||
- ❌ Retry strategies with backoff
|
||||
|
||||
### What Needs To Be Done
|
||||
|
||||
1. Install and configure `opossum` circuit breaker library
|
||||
2. Wrap AI service calls with circuit breaker
|
||||
3. Add fail-open to rate limiting
|
||||
4. Extend health endpoints with service status
|
||||
5. Document degraded mode behaviors
|
||||
|
||||
## Key Files
|
||||
|
||||
- `src/utils/circuitBreaker.ts` - Circuit breaker configurations (to create)
|
||||
- `src/services/cacheService.server.ts` - Already has graceful fallbacks
|
||||
- `src/routes/health.routes.ts` - Health check endpoints (to extend)
|
||||
- `src/services/aiService.server.ts` - AI service wrapper (to wrap)
|
||||
Reference in New Issue
Block a user