Files
flyer-crawler.projectium.com/docs/adr/0046-image-processing-pipeline.md
Torben Sorensen e14c19c112
All checks were successful
Deploy to Test Environment / deploy-to-test (push) Successful in 16m0s
linting docs + some fixes go claude and gemini
2026-01-09 22:38:57 -08:00

364 lines
10 KiB
Markdown

# ADR-046: Image Processing Pipeline
**Date**: 2026-01-09
**Status**: Accepted
**Implemented**: 2026-01-09
## Context
The application handles significant image processing for flyer uploads:
1. **Privacy Protection**: Strip EXIF metadata (location, device info).
2. **Optimization**: Resize, compress, and convert images for web delivery.
3. **Icon Generation**: Create thumbnails for listing views.
4. **Format Support**: Handle JPEG, PNG, WebP, and PDF inputs.
5. **Storage Management**: Organize processed images on disk.
These operations must be:
- **Performant**: Large images should not block the request.
- **Secure**: Prevent malicious file uploads.
- **Consistent**: Produce predictable output quality.
- **Testable**: Support unit testing without real files.
## Decision
We will implement a modular image processing pipeline using:
1. **Sharp**: For image resizing, compression, and format conversion.
2. **EXIF Parsing**: For metadata extraction and stripping.
3. **UUID Naming**: For unique, non-guessable file names.
4. **Directory Structure**: Organized storage for originals and derivatives.
### Design Principles
- **Pipeline Pattern**: Chain processing steps in a predictable order.
- **Fail-Fast Validation**: Reject invalid files before processing.
- **Idempotent Operations**: Same input produces same output.
- **Resource Cleanup**: Delete temp files on error.
## Implementation Details
### Image Processor Module
Located in `src/utils/imageProcessor.ts`:
```typescript
import sharp from 'sharp';
import path from 'path';
import { v4 as uuidv4 } from 'uuid';
import fs from 'fs/promises';
import type { Logger } from 'pino';
// ============================================
// CONFIGURATION
// ============================================
const IMAGE_CONFIG = {
maxWidth: 2048,
maxHeight: 2048,
quality: 85,
iconSize: 200,
allowedFormats: ['jpeg', 'png', 'webp', 'avif'],
outputFormat: 'webp' as const,
};
// ============================================
// MAIN PROCESSING FUNCTION
// ============================================
export async function processAndSaveImage(
inputPath: string,
outputDir: string,
originalFileName: string,
logger: Logger,
): Promise<string> {
const outputFileName = `${uuidv4()}.${IMAGE_CONFIG.outputFormat}`;
const outputPath = path.join(outputDir, outputFileName);
logger.info({ inputPath, outputPath }, 'Processing image');
try {
// Create sharp instance and strip metadata
await sharp(inputPath)
.rotate() // Auto-rotate based on EXIF orientation
.resize(IMAGE_CONFIG.maxWidth, IMAGE_CONFIG.maxHeight, {
fit: 'inside',
withoutEnlargement: true,
})
.webp({ quality: IMAGE_CONFIG.quality })
.toFile(outputPath);
logger.info({ outputPath }, 'Image processed successfully');
return outputFileName;
} catch (error) {
logger.error({ error, inputPath }, 'Image processing failed');
throw error;
}
}
```
### Icon Generation
```typescript
export async function generateFlyerIcon(
inputPath: string,
iconsDir: string,
logger: Logger,
): Promise<string> {
// Ensure icons directory exists
await fs.mkdir(iconsDir, { recursive: true });
const iconFileName = `${uuidv4()}-icon.webp`;
const iconPath = path.join(iconsDir, iconFileName);
logger.info({ inputPath, iconPath }, 'Generating icon');
await sharp(inputPath)
.resize(IMAGE_CONFIG.iconSize, IMAGE_CONFIG.iconSize, {
fit: 'cover',
position: 'top', // Flyers usually have store name at top
})
.webp({ quality: 80 })
.toFile(iconPath);
logger.info({ iconPath }, 'Icon generated successfully');
return iconFileName;
}
```
### EXIF Metadata Extraction
For audit/logging purposes before stripping:
```typescript
import ExifParser from 'exif-parser';
export async function extractExifMetadata(
filePath: string,
logger: Logger,
): Promise<ExifMetadata | null> {
try {
const buffer = await fs.readFile(filePath);
const parser = ExifParser.create(buffer);
const result = parser.parse();
const metadata: ExifMetadata = {
make: result.tags?.Make,
model: result.tags?.Model,
dateTime: result.tags?.DateTimeOriginal,
gpsLatitude: result.tags?.GPSLatitude,
gpsLongitude: result.tags?.GPSLongitude,
orientation: result.tags?.Orientation,
};
// Log if GPS data was present (privacy concern)
if (metadata.gpsLatitude || metadata.gpsLongitude) {
logger.info({ filePath }, 'GPS data found in image, will be stripped during processing');
}
return metadata;
} catch (error) {
logger.debug({ error, filePath }, 'No EXIF data found or parsing failed');
return null;
}
}
```
### PDF to Image Conversion
```typescript
import * as pdfjs from 'pdfjs-dist';
export async function convertPdfToImages(
pdfPath: string,
outputDir: string,
logger: Logger,
): Promise<string[]> {
const pdfData = await fs.readFile(pdfPath);
const pdf = await pdfjs.getDocument({ data: pdfData }).promise;
const outputPaths: string[] = [];
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const viewport = page.getViewport({ scale: 2.0 }); // 2x for quality
// Create canvas and render
const canvas = createCanvas(viewport.width, viewport.height);
const context = canvas.getContext('2d');
await page.render({
canvasContext: context,
viewport: viewport,
}).promise;
// Save as image
const outputFileName = `${uuidv4()}-page-${i}.png`;
const outputPath = path.join(outputDir, outputFileName);
const buffer = canvas.toBuffer('image/png');
await fs.writeFile(outputPath, buffer);
outputPaths.push(outputPath);
logger.info({ page: i, outputPath }, 'PDF page converted to image');
}
return outputPaths;
}
```
### File Validation
```typescript
import { fileTypeFromBuffer } from 'file-type';
export async function validateImageFile(
filePath: string,
logger: Logger,
): Promise<{ valid: boolean; mimeType: string | null; error?: string }> {
try {
const buffer = await fs.readFile(filePath, { length: 4100 }); // Read header only
const type = await fileTypeFromBuffer(buffer);
if (!type) {
return { valid: false, mimeType: null, error: 'Unknown file type' };
}
const allowedMimes = ['image/jpeg', 'image/png', 'image/webp', 'image/avif', 'application/pdf'];
if (!allowedMimes.includes(type.mime)) {
return {
valid: false,
mimeType: type.mime,
error: `File type ${type.mime} not allowed`,
};
}
return { valid: true, mimeType: type.mime };
} catch (error) {
logger.error({ error, filePath }, 'File validation failed');
return { valid: false, mimeType: null, error: 'Validation error' };
}
}
```
### Storage Organization
```
flyer-images/
├── originals/ # Uploaded files (if kept)
│ └── {uuid}.{ext}
├── processed/ # Optimized images (or root level)
│ └── {uuid}.webp
├── icons/ # Thumbnails
│ └── {uuid}-icon.webp
└── temp/ # Temporary processing files
└── {uuid}.tmp
```
### Cleanup Utilities
```typescript
export async function cleanupTempFiles(
tempDir: string,
maxAgeMs: number,
logger: Logger,
): Promise<number> {
const files = await fs.readdir(tempDir);
const now = Date.now();
let deletedCount = 0;
for (const file of files) {
const filePath = path.join(tempDir, file);
const stats = await fs.stat(filePath);
const age = now - stats.mtimeMs;
if (age > maxAgeMs) {
await fs.unlink(filePath);
deletedCount++;
}
}
logger.info({ deletedCount, tempDir }, 'Cleaned up temp files');
return deletedCount;
}
```
### Integration with Flyer Processing
```typescript
// In flyerProcessingService.ts
export async function processUploadedFlyer(
file: Express.Multer.File,
logger: Logger,
): Promise<{ imageUrl: string; iconUrl: string }> {
const flyerImageDir = 'flyer-images';
const iconsDir = path.join(flyerImageDir, 'icons');
// 1. Validate file
const validation = await validateImageFile(file.path, logger);
if (!validation.valid) {
throw new ValidationError([{ path: 'file', message: validation.error! }]);
}
// 2. Extract and log EXIF before stripping
await extractExifMetadata(file.path, logger);
// 3. Process and optimize image
const processedFileName = await processAndSaveImage(
file.path,
flyerImageDir,
file.originalname,
logger,
);
// 4. Generate icon
const processedImagePath = path.join(flyerImageDir, processedFileName);
const iconFileName = await generateFlyerIcon(processedImagePath, iconsDir, logger);
// 5. Construct URLs
const baseUrl = process.env.BACKEND_URL || 'http://localhost:3001';
const imageUrl = `${baseUrl}/flyer-images/${processedFileName}`;
const iconUrl = `${baseUrl}/flyer-images/icons/${iconFileName}`;
// 6. Delete original upload (privacy)
await fs.unlink(file.path);
return { imageUrl, iconUrl };
}
```
## Consequences
### Positive
- **Privacy**: EXIF metadata (including GPS) is stripped automatically.
- **Performance**: WebP output reduces file sizes by 25-35%.
- **Consistency**: All images processed to standard format and dimensions.
- **Security**: File type validation prevents malicious uploads.
- **Organization**: Clear directory structure for storage management.
### Negative
- **CPU Intensive**: Image processing can be slow for large files.
- **Storage**: Keeping originals doubles storage requirements.
- **Dependency**: Sharp requires native binaries.
### Mitigation
- Process images in background jobs (BullMQ queue).
- Configure whether to keep originals based on requirements.
- Use pre-built Sharp binaries via npm.
## Key Files
- `src/utils/imageProcessor.ts` - Core image processing functions
- `src/services/flyer/flyerProcessingService.ts` - Integration with flyer workflow
- `src/middleware/fileUpload.middleware.ts` - Multer configuration
## Related ADRs
- [ADR-033](./0033-file-upload-and-storage-strategy.md) - File Upload Strategy
- [ADR-006](./0006-background-job-processing-and-task-queues.md) - Background Jobs
- [ADR-041](./0041-ai-gemini-integration-architecture.md) - AI Integration (uses processed images)