integration test fixes - claude for the win? try 4 - i have a good feeling
Some checks failed
Deploy to Test Environment / deploy-to-test (push) Failing after 16m58s

This commit is contained in:
2026-01-09 05:55:55 -08:00
parent 1814469eb4
commit 4a04e478c4
20 changed files with 3006 additions and 570 deletions

View File

@@ -0,0 +1,150 @@
# ADR-030: Graceful Degradation and Circuit Breaker Pattern
**Date**: 2026-01-09
**Status**: Proposed
## Context
The application depends on several external services:
1. **AI Services** (Google Gemini) - For flyer item extraction
2. **Redis** - For caching, rate limiting, and job queues
3. **PostgreSQL** - Primary data store
4. **Geocoding APIs** - For location services
Currently, when these services fail:
- AI failures may cause the entire upload to fail
- Redis unavailability could crash the application or bypass rate limiting
- No circuit breakers prevent repeated calls to failing services
- No fallback behaviors are defined
This creates fragility where a single service outage can cascade into application-wide failures.
## Decision
We will implement a graceful degradation strategy with circuit breakers for external service dependencies.
### 1. Circuit Breaker Pattern
Implement circuit breakers for external service calls using a library like `opossum`:
```typescript
import CircuitBreaker from 'opossum';
const aiCircuitBreaker = new CircuitBreaker(callAiService, {
timeout: 30000, // 30 second timeout
errorThresholdPercentage: 50, // Open circuit at 50% failures
resetTimeout: 30000, // Try again after 30 seconds
volumeThreshold: 5, // Minimum calls before calculating error %
});
aiCircuitBreaker.on('open', () => {
logger.warn('AI service circuit breaker opened');
});
aiCircuitBreaker.on('halfOpen', () => {
logger.info('AI service circuit breaker half-open, testing...');
});
```
### 2. Fallback Behaviors by Service
| Service | Fallback Behavior |
| ---------------------- | ---------------------------------------- |
| **Redis (Cache)** | Skip cache, query database directly |
| **Redis (Rate Limit)** | Log warning, allow request (fail-open) |
| **Redis (Queues)** | Queue to memory, process synchronously |
| **AI Service** | Return partial results, queue for retry |
| **Geocoding** | Return null location, allow manual entry |
| **PostgreSQL** | No fallback - critical dependency |
### 3. Health Status Aggregation
Extend health checks (ADR-020) to report service-level health:
```typescript
// GET /api/health/ready response
{
"status": "degraded", // healthy | degraded | unhealthy
"services": {
"database": { "status": "healthy", "latency": 5 },
"redis": { "status": "healthy", "latency": 2 },
"ai": { "status": "degraded", "circuitState": "half-open" },
"geocoding": { "status": "healthy", "latency": 150 }
}
}
```
### 4. Retry Strategies
Define retry policies for transient failures:
```typescript
const retryConfig = {
ai: { maxRetries: 3, backoff: 'exponential', initialDelay: 1000 },
geocoding: { maxRetries: 2, backoff: 'linear', initialDelay: 500 },
database: { maxRetries: 3, backoff: 'exponential', initialDelay: 100 },
};
```
## Implementation Approach
### Phase 1: Redis Fallbacks
- Wrap cache operations with try-catch (already partially done in cacheService)
- Add fail-open for rate limiting when Redis is down
- Log degraded state
### Phase 2: AI Circuit Breaker
- Wrap AI service calls with circuit breaker
- Implement queue-for-retry on circuit open
- Add manual fallback UI for failed extractions
### Phase 3: Health Aggregation
- Update health endpoints with service status
- Add Prometheus-compatible metrics
- Create dashboard for service health
## Consequences
### Positive
- **Resilience**: Application continues functioning during partial outages
- **User Experience**: Degraded but functional is better than complete failure
- **Observability**: Clear visibility into service health
- **Protection**: Circuit breakers prevent cascading failures
### Negative
- **Complexity**: Additional code for fallback logic
- **Testing**: Requires testing failure scenarios
- **Consistency**: Some operations may have different results during degradation
## Implementation Status
### What's Implemented
- ✅ Cache operations fail gracefully (cacheService.server.ts)
- ❌ Circuit breakers for AI services
- ❌ Rate limit fail-open behavior
- ❌ Health aggregation endpoint
- ❌ Retry strategies with backoff
### What Needs To Be Done
1. Install and configure `opossum` circuit breaker library
2. Wrap AI service calls with circuit breaker
3. Add fail-open to rate limiting
4. Extend health endpoints with service status
5. Document degraded mode behaviors
## Key Files
- `src/utils/circuitBreaker.ts` - Circuit breaker configurations (to create)
- `src/services/cacheService.server.ts` - Already has graceful fallbacks
- `src/routes/health.routes.ts` - Health check endpoints (to extend)
- `src/services/aiService.server.ts` - AI service wrapper (to wrap)