integration test fixes - claude for the win? try 4 - i have a good feeling

2026-01-09 05:55:55 -08:00
parent 1814469eb4
commit 4a04e478c4
20 changed files with 3006 additions and 570 deletions
--- a/docs/adr/0030-graceful-degradation-and-circuit-breaker.md
+++ b/docs/adr/0030-graceful-degradation-and-circuit-breaker.md
@@ -0,0 +1,150 @@
+# ADR-030: Graceful Degradation and Circuit Breaker Pattern
+
+**Date**: 2026-01-09
+
+**Status**: Proposed
+
+## Context
+
+The application depends on several external services:
+
+1. **AI Services** (Google Gemini) - For flyer item extraction
+2. **Redis** - For caching, rate limiting, and job queues
+3. **PostgreSQL** - Primary data store
+4. **Geocoding APIs** - For location services
+
+Currently, when these services fail:
+
+- AI failures may cause the entire upload to fail
+- Redis unavailability could crash the application or bypass rate limiting
+- No circuit breakers prevent repeated calls to failing services
+- No fallback behaviors are defined
+
+This creates fragility where a single service outage can cascade into application-wide failures.
+
+## Decision
+
+We will implement a graceful degradation strategy with circuit breakers for external service dependencies.
+
+### 1. Circuit Breaker Pattern
+
+Implement circuit breakers for external service calls using a library like `opossum`:
+
+```typescript
+import CircuitBreaker from 'opossum';
+
+const aiCircuitBreaker = new CircuitBreaker(callAiService, {
+  timeout: 30000, // 30 second timeout
+  errorThresholdPercentage: 50, // Open circuit at 50% failures
+  resetTimeout: 30000, // Try again after 30 seconds
+  volumeThreshold: 5, // Minimum calls before calculating error %
+});
+
+aiCircuitBreaker.on('open', () => {
+  logger.warn('AI service circuit breaker opened');
+});
+
+aiCircuitBreaker.on('halfOpen', () => {
+  logger.info('AI service circuit breaker half-open, testing...');
+});
+```
+
+### 2. Fallback Behaviors by Service
+
+| Service                | Fallback Behavior                        |
+| ---------------------- | ---------------------------------------- |
+| **Redis (Cache)**      | Skip cache, query database directly      |
+| **Redis (Rate Limit)** | Log warning, allow request (fail-open)   |
+| **Redis (Queues)**     | Queue to memory, process synchronously   |
+| **AI Service**         | Return partial results, queue for retry  |
+| **Geocoding**          | Return null location, allow manual entry |
+| **PostgreSQL**         | No fallback - critical dependency        |
+
+### 3. Health Status Aggregation
+
+Extend health checks (ADR-020) to report service-level health:
+
+```typescript
+// GET /api/health/ready response
+{
+  "status": "degraded",  // healthy | degraded | unhealthy
+  "services": {
+    "database": { "status": "healthy", "latency": 5 },
+    "redis": { "status": "healthy", "latency": 2 },
+    "ai": { "status": "degraded", "circuitState": "half-open" },
+    "geocoding": { "status": "healthy", "latency": 150 }
+  }
+}
+```
+
+### 4. Retry Strategies
+
+Define retry policies for transient failures:
+
+```typescript
+const retryConfig = {
+  ai: { maxRetries: 3, backoff: 'exponential', initialDelay: 1000 },
+  geocoding: { maxRetries: 2, backoff: 'linear', initialDelay: 500 },
+  database: { maxRetries: 3, backoff: 'exponential', initialDelay: 100 },
+};
+```
+
+## Implementation Approach
+
+### Phase 1: Redis Fallbacks
+
+- Wrap cache operations with try-catch (already partially done in cacheService)
+- Add fail-open for rate limiting when Redis is down
+- Log degraded state
+
+### Phase 2: AI Circuit Breaker
+
+- Wrap AI service calls with circuit breaker
+- Implement queue-for-retry on circuit open
+- Add manual fallback UI for failed extractions
+
+### Phase 3: Health Aggregation
+
+- Update health endpoints with service status
+- Add Prometheus-compatible metrics
+- Create dashboard for service health
+
+## Consequences
+
+### Positive
+
+- **Resilience**: Application continues functioning during partial outages
+- **User Experience**: Degraded but functional is better than complete failure
+- **Observability**: Clear visibility into service health
+- **Protection**: Circuit breakers prevent cascading failures
+
+### Negative
+
+- **Complexity**: Additional code for fallback logic
+- **Testing**: Requires testing failure scenarios
+- **Consistency**: Some operations may have different results during degradation
+
+## Implementation Status
+
+### What's Implemented
+
+- ✅ Cache operations fail gracefully (cacheService.server.ts)
+- ❌ Circuit breakers for AI services
+- ❌ Rate limit fail-open behavior
+- ❌ Health aggregation endpoint
+- ❌ Retry strategies with backoff
+
+### What Needs To Be Done
+
+1. Install and configure `opossum` circuit breaker library
+2. Wrap AI service calls with circuit breaker
+3. Add fail-open to rate limiting
+4. Extend health endpoints with service status
+5. Document degraded mode behaviors
+
+## Key Files
+
+- `src/utils/circuitBreaker.ts` - Circuit breaker configurations (to create)
+- `src/services/cacheService.server.ts` - Already has graceful fallbacks
+- `src/routes/health.routes.ts` - Health check endpoints (to extend)
+- `src/services/aiService.server.ts` - AI service wrapper (to wrap)