Catalog Weaver
Windows desktop application for enriching automotive aftermarket Excel catalogs with product images, vehicle compatibility, and AI-assisted quality control — polite browser discovery, caching, and export back to Excel.
Overview
Catalog Weaver is a Windows desktop application designed for structured enrichment of auto-parts Excel catalogs.
The application focuses on controlled, resumable workflows, allowing users to:
- import supplier catalogs from Excel;
- search product images across EU aftermarket sources via browser and HTTP discovery;
- infer vehicle compatibility (passenger cars and commercial vehicles);
- run an optional AI consensus pipeline to rank candidates and filter poor-quality images;
- cache results by brand + article and resume interrupted batches;
- export enriched data back to Excel with embedded images.
The system was designed with modular search providers, profile-based configuration, and headless CLI support for batch demos and CI-safe testing.
Context
The project originated from the need to accelerate catalog preparation for aftermarket parts distributors: hundreds or thousands of rows with brand, article, and description, but missing reliable product photos and fitment lists.
The target workflow required:
- non-invasive crawling (throttling, single-part concurrency, Google session handling);
- Ukrainian and Latin column mapping for real supplier files;
- confidence scoring and manual review for uncertain matches;
- vehicle compatibility without manual TecDoc lookup for every row;
- resumable long runs (cache, skip-processed rows, periodic auto-save);
- export that keeps original columns and adds only image URL, embedded image, and compatibility.
A major design requirement was combining traditional scraping with LLM-assisted decision-making without losing auditability or blowing through API budgets on every candidate image.
Responsibilities
Responsibilities included:
- overall solution architecture and phased delivery plan;
- WPF desktop UI and settings workflow;
- search orchestration (Playwright, DuckDuckGo/Bing HTTP, site-scoped providers);
- AI consensus pipeline (Gemini pre-rank → final model → rule + vision image quality);
- compatibility layer (Google AI overview, OpenAI API fitment);
- SQLite and file-based consensus caching;
- Excel import/export (ClosedXML);
- throttling, failsafe batch stop, logging (Serilog);
- CLI for headless batch runs and CI test slice.
Solution
The solution was implemented as a .NET 8 WPF application with clear layers: Core, Search, Consensus, Compatibility, Cache, Export, Excel, Images, Infrastructure, App, and Cli.
Users can:
- open an Excel catalog and map columns (Brand, Article, Description, etc.);
- choose a search profile (e.g. EU aftermarket suppliers, catalog-only mode);
- start a batch search with live progress and row-level status;
- review selected images, compatibility summaries, and watermark warnings;
- export
{catalog}_results.xlsxwith source columns preserved plus three new fields; - clear cache and re-run selected rows after tuning profiles or API keys.
AI consensus mode (optional) chains Playwright discovery, Gemini pre-rank, final model selection, image quality screening, and API-based compatibility when browser scraping is skipped or empty.
Special attention was paid to polite automation: concurrency of one part at a time, configurable delays, early stop on Google CAPTCHA, and failsafe stop after repeated infrastructure failures.
Technical Details
Stack
- C# / .NET 8
- WPF (desktop UI)
- Playwright (Chrome persistent profile, organic SERP navigation)
- AngleSharp / HTML parsers (product page extraction)
- ClosedXML (Excel read/write, embedded images)
- SQLite (search result cache)
- JSON file cache (consensus results)
- Serilog (rolling file logs)
- HttpClient (Bing, DuckDuckGo, OpenAI-compatible APIs)
- Gemini + OpenAI-compatible chat/vision endpoints
Architecture
Core layers:
- Search — provider pipeline, discovery service, confidence scoring, visual verification;
- Consensus — candidate collection, two-stage evaluator, image quality integrator;
- Compatibility — orchestrator, AI client, Google AI overview parser;
- Cache — SQLite repository + file consensus cache;
- Export — slim Excel export, CSV batch summary, manual review workbook;
- App — MainViewModel, settings, profiles, progress and preview UI;
- Cli — headless batch for demos and automation.
Search profiles (JSON in work folder) define allowed domains, catalog-only mode, compatibility limits, and provider enablement without recompiling.
Functionality
Implemented functionality includes:
- Excel import with column mapping and Ukrainian header detection;
- image search via site-scoped web, Playwright catalogs, image search APIs;
- vehicle compatibility (browser + API paths);
- AI consensus with two-stage ranking and image quality scoring;
- local image download (PNG/JPEG) or URL-only mode;
- cache hydration on file reopen (resume without re-scraping);
- skip successful rows on re-run;
- periodic auto-save (interval + row count);
- auto-export after job completion;
- consecutive-failure failsafe with log + user notification;
- job CSV logs and optional manual-review export;
- work-folder portability (database, images, profiles, logs, settings).
Challenges
Main challenges included:
- anti-bot friction — Google CAPTCHA, Cloudflare on supplier sites, DuckDuckGo HTTP blocks;
- article format mismatch — dots, dashes, parentheses in SKU vs page HTML;
- image false positives — marketplace watermarks, generic category pages ranked above product pages;
- LLM cost vs quality — pre-rank on all candidates, vision only on top images;
- long-run reliability — cache before disk write, auto-save during 100+ row batches;
- commercial + passenger fitment — prompt and parse scope for trucks vs universal consumables;
- protected GitLab workflow — feature branches and merge requests for main.
Another important challenge was keeping the export contract simple for downstream ERP/marketplace tools: original columns untouched, only three enrichment columns added.
Result
The project successfully demonstrated:
- end-to-end catalog enrichment from Excel and back to Excel;
- reusable search provider and profile architecture;
- production-oriented AI consensus with fallback to rule-based scoring;
- resumable batches via SQLite + consensus cache + skip-processed rows;
- failsafe and auto-save for unattended long runs;
- modular codebase suitable for CLI, CI tests, and future supplier-specific adapters.
The architecture can serve as a foundation for aftermarket catalog automation, PIM prep, and marketplace listing workflows.
Notes
- Client catalog samples and API keys are not published.
- AI model IDs and pricing evolve; defaults migrate on settings load.
- Built for Windows 10+; Playwright browser profile used for Google sign-in.
- Focused on polite crawling and human-in-the-loop review for low-confidence rows.