Back to portfolio

Catalog Weaver

Windows desktop application for enriching automotive aftermarket Excel catalogs with product images, vehicle compatibility, and AI-assisted quality control — polite browser discovery, caching, and export back to Excel.

Supplier-specific catalog data and API credentials omitted from public case study.

Category

Desktop Apps

Date

2026-06

Status

Production-ready MVP

Role

Architecture, backend development, desktop UI, AI pipeline integration

Technologies

C#, .NET 8, WPF, Playwright, ClosedXML, SQLite, Serilog, OpenAI API, Gemini API

Tags

C#, WPF, Automotive, Catalog Enrichment, Web Scraping, AI Consensus, Excel Automation

Overview

Catalog Weaver is a Windows desktop application designed for structured enrichment of auto-parts Excel catalogs.

The application focuses on controlled, resumable workflows, allowing users to:

  • import supplier catalogs from Excel;
  • search product images across EU aftermarket sources via browser and HTTP discovery;
  • infer vehicle compatibility (passenger cars and commercial vehicles);
  • run an optional AI consensus pipeline to rank candidates and filter poor-quality images;
  • cache results by brand + article and resume interrupted batches;
  • export enriched data back to Excel with embedded images.

The system was designed with modular search providers, profile-based configuration, and headless CLI support for batch demos and CI-safe testing.


Context

The project originated from the need to accelerate catalog preparation for aftermarket parts distributors: hundreds or thousands of rows with brand, article, and description, but missing reliable product photos and fitment lists.

The target workflow required:

  • non-invasive crawling (throttling, single-part concurrency, Google session handling);
  • Ukrainian and Latin column mapping for real supplier files;
  • confidence scoring and manual review for uncertain matches;
  • vehicle compatibility without manual TecDoc lookup for every row;
  • resumable long runs (cache, skip-processed rows, periodic auto-save);
  • export that keeps original columns and adds only image URL, embedded image, and compatibility.

A major design requirement was combining traditional scraping with LLM-assisted decision-making without losing auditability or blowing through API budgets on every candidate image.


Responsibilities

Responsibilities included:

  • overall solution architecture and phased delivery plan;
  • WPF desktop UI and settings workflow;
  • search orchestration (Playwright, DuckDuckGo/Bing HTTP, site-scoped providers);
  • AI consensus pipeline (Gemini pre-rank → final model → rule + vision image quality);
  • compatibility layer (Google AI overview, OpenAI API fitment);
  • SQLite and file-based consensus caching;
  • Excel import/export (ClosedXML);
  • throttling, failsafe batch stop, logging (Serilog);
  • CLI for headless batch runs and CI test slice.

Solution

The solution was implemented as a .NET 8 WPF application with clear layers: Core, Search, Consensus, Compatibility, Cache, Export, Excel, Images, Infrastructure, App, and Cli.

Users can:

  • open an Excel catalog and map columns (Brand, Article, Description, etc.);
  • choose a search profile (e.g. EU aftermarket suppliers, catalog-only mode);
  • start a batch search with live progress and row-level status;
  • review selected images, compatibility summaries, and watermark warnings;
  • export {catalog}_results.xlsx with source columns preserved plus three new fields;
  • clear cache and re-run selected rows after tuning profiles or API keys.

AI consensus mode (optional) chains Playwright discovery, Gemini pre-rank, final model selection, image quality screening, and API-based compatibility when browser scraping is skipped or empty.

Special attention was paid to polite automation: concurrency of one part at a time, configurable delays, early stop on Google CAPTCHA, and failsafe stop after repeated infrastructure failures.


Technical Details

Stack

  • C# / .NET 8
  • WPF (desktop UI)
  • Playwright (Chrome persistent profile, organic SERP navigation)
  • AngleSharp / HTML parsers (product page extraction)
  • ClosedXML (Excel read/write, embedded images)
  • SQLite (search result cache)
  • JSON file cache (consensus results)
  • Serilog (rolling file logs)
  • HttpClient (Bing, DuckDuckGo, OpenAI-compatible APIs)
  • Gemini + OpenAI-compatible chat/vision endpoints

Architecture

Core layers:

  • Search — provider pipeline, discovery service, confidence scoring, visual verification;
  • Consensus — candidate collection, two-stage evaluator, image quality integrator;
  • Compatibility — orchestrator, AI client, Google AI overview parser;
  • Cache — SQLite repository + file consensus cache;
  • Export — slim Excel export, CSV batch summary, manual review workbook;
  • App — MainViewModel, settings, profiles, progress and preview UI;
  • Cli — headless batch for demos and automation.

Search profiles (JSON in work folder) define allowed domains, catalog-only mode, compatibility limits, and provider enablement without recompiling.

Functionality

Implemented functionality includes:

  • Excel import with column mapping and Ukrainian header detection;
  • image search via site-scoped web, Playwright catalogs, image search APIs;
  • vehicle compatibility (browser + API paths);
  • AI consensus with two-stage ranking and image quality scoring;
  • local image download (PNG/JPEG) or URL-only mode;
  • cache hydration on file reopen (resume without re-scraping);
  • skip successful rows on re-run;
  • periodic auto-save (interval + row count);
  • auto-export after job completion;
  • consecutive-failure failsafe with log + user notification;
  • job CSV logs and optional manual-review export;
  • work-folder portability (database, images, profiles, logs, settings).

Challenges

Main challenges included:

  • anti-bot friction — Google CAPTCHA, Cloudflare on supplier sites, DuckDuckGo HTTP blocks;
  • article format mismatch — dots, dashes, parentheses in SKU vs page HTML;
  • image false positives — marketplace watermarks, generic category pages ranked above product pages;
  • LLM cost vs quality — pre-rank on all candidates, vision only on top images;
  • long-run reliability — cache before disk write, auto-save during 100+ row batches;
  • commercial + passenger fitment — prompt and parse scope for trucks vs universal consumables;
  • protected GitLab workflow — feature branches and merge requests for main.

Another important challenge was keeping the export contract simple for downstream ERP/marketplace tools: original columns untouched, only three enrichment columns added.


Result

The project successfully demonstrated:

  • end-to-end catalog enrichment from Excel and back to Excel;
  • reusable search provider and profile architecture;
  • production-oriented AI consensus with fallback to rule-based scoring;
  • resumable batches via SQLite + consensus cache + skip-processed rows;
  • failsafe and auto-save for unattended long runs;
  • modular codebase suitable for CLI, CI tests, and future supplier-specific adapters.

The architecture can serve as a foundation for aftermarket catalog automation, PIM prep, and marketplace listing workflows.


Notes

  • Client catalog samples and API keys are not published.
  • AI model IDs and pricing evolve; defaults migrate on settings load.
  • Built for Windows 10+; Playwright browser profile used for Google sign-in.
  • Focused on polite crawling and human-in-the-loop review for low-confidence rows.

Gallery

Video

Video walkthrough is not published yet.

Links

External link is not available for this case.