Catalog Weaver

Windows desktop application for enriching automotive aftermarket Excel catalogs with product images, vehicle compatibility, and AI-assisted quality control — polite browser discovery, caching, and export back to Excel.

Supplier-specific catalog data and API credentials omitted from public case study.

Overview

Catalog Weaver is a Windows desktop application designed for structured enrichment of auto-parts Excel catalogs.

The application focuses on controlled, resumable workflows, allowing users to:

import supplier catalogs from Excel;
search product images across EU aftermarket sources via browser and HTTP discovery;
infer vehicle compatibility (passenger cars and commercial vehicles);
run an optional AI consensus pipeline to rank candidates and filter poor-quality images;
cache results by brand + article and resume interrupted batches;
export enriched data back to Excel with embedded images.

The system was designed with modular search providers, profile-based configuration, and headless CLI support for batch demos and CI-safe testing.

Context

The project originated from the need to accelerate catalog preparation for aftermarket parts distributors: hundreds or thousands of rows with brand, article, and description, but missing reliable product photos and fitment lists.

The target workflow required:

non-invasive crawling (throttling, single-part concurrency, Google session handling);
Ukrainian and Latin column mapping for real supplier files;
confidence scoring and manual review for uncertain matches;
vehicle compatibility without manual TecDoc lookup for every row;
resumable long runs (cache, skip-processed rows, periodic auto-save);
export that keeps original columns and adds only image URL, embedded image, and compatibility.

A major design requirement was combining traditional scraping with LLM-assisted decision-making without losing auditability or blowing through API budgets on every candidate image.

Responsibilities

Responsibilities included:

overall solution architecture and phased delivery plan;
WPF desktop UI and settings workflow;
search orchestration (Playwright, DuckDuckGo/Bing HTTP, site-scoped providers);
AI consensus pipeline (Gemini pre-rank → final model → rule + vision image quality);
compatibility layer (Google AI overview, OpenAI API fitment);
SQLite and file-based consensus caching;
Excel import/export (ClosedXML);
throttling, failsafe batch stop, logging (Serilog);
CLI for headless batch runs and CI test slice.

Solution

The solution was implemented as a .NET 8 WPF application with clear layers: Core, Search, Consensus, Compatibility, Cache, Export, Excel, Images, Infrastructure, App, and Cli.

Users can:

open an Excel catalog and map columns (Brand, Article, Description, etc.);
choose a search profile (e.g. EU aftermarket suppliers, catalog-only mode);
start a batch search with live progress and row-level status;
review selected images, compatibility summaries, and watermark warnings;
export {catalog}_results.xlsx with source columns preserved plus three new fields;
clear cache and re-run selected rows after tuning profiles or API keys.

AI consensus mode (optional) chains Playwright discovery, Gemini pre-rank, final model selection, image quality screening, and API-based compatibility when browser scraping is skipped or empty.

Special attention was paid to polite automation: concurrency of one part at a time, configurable delays, early stop on Google CAPTCHA, and failsafe stop after repeated infrastructure failures.

Technical Details

Stack

C# / .NET 8
WPF (desktop UI)
Playwright (Chrome persistent profile, organic SERP navigation)
AngleSharp / HTML parsers (product page extraction)
ClosedXML (Excel read/write, embedded images)
SQLite (search result cache)
JSON file cache (consensus results)
Serilog (rolling file logs)
HttpClient (Bing, DuckDuckGo, OpenAI-compatible APIs)
Gemini + OpenAI-compatible chat/vision endpoints

Architecture

Core layers:

Search — provider pipeline, discovery service, confidence scoring, visual verification;
Consensus — candidate collection, two-stage evaluator, image quality integrator;
Compatibility — orchestrator, AI client, Google AI overview parser;
Cache — SQLite repository + file consensus cache;
Export — slim Excel export, CSV batch summary, manual review workbook;
App — MainViewModel, settings, profiles, progress and preview UI;
Cli — headless batch for demos and automation.

Search profiles (JSON in work folder) define allowed domains, catalog-only mode, compatibility limits, and provider enablement without recompiling.

Functionality

Implemented functionality includes:

Excel import with column mapping and Ukrainian header detection;
image search via site-scoped web, Playwright catalogs, image search APIs;
vehicle compatibility (browser + API paths);
AI consensus with two-stage ranking and image quality scoring;
local image download (PNG/JPEG) or URL-only mode;
cache hydration on file reopen (resume without re-scraping);
skip successful rows on re-run;
periodic auto-save (interval + row count);
auto-export after job completion;
consecutive-failure failsafe with log + user notification;
job CSV logs and optional manual-review export;
work-folder portability (database, images, profiles, logs, settings).

Challenges

Main challenges included:

anti-bot friction — Google CAPTCHA, Cloudflare on supplier sites, DuckDuckGo HTTP blocks;
article format mismatch — dots, dashes, parentheses in SKU vs page HTML;
image false positives — marketplace watermarks, generic category pages ranked above product pages;
LLM cost vs quality — pre-rank on all candidates, vision only on top images;
long-run reliability — cache before disk write, auto-save during 100+ row batches;
commercial + passenger fitment — prompt and parse scope for trucks vs universal consumables;
protected GitLab workflow — feature branches and merge requests for main.

Another important challenge was keeping the export contract simple for downstream ERP/marketplace tools: original columns untouched, only three enrichment columns added.

Result

The project successfully demonstrated:

end-to-end catalog enrichment from Excel and back to Excel;
reusable search provider and profile architecture;
production-oriented AI consensus with fallback to rule-based scoring;
resumable batches via SQLite + consensus cache + skip-processed rows;
failsafe and auto-save for unattended long runs;
modular codebase suitable for CLI, CI tests, and future supplier-specific adapters.

The architecture can serve as a foundation for aftermarket catalog automation, PIM prep, and marketplace listing workflows.

Notes

Client catalog samples and API keys are not published.
AI model IDs and pricing evolve; defaults migrate on settings load.
Built for Windows 10+; Playwright browser profile used for Google sign-in.
Focused on polite crawling and human-in-the-loop review for low-confidence rows.