Case study · 2025 to present

Ferremundo AI: fine-tuned retrieval, image embeddings, and LangGraph ordering for B2B hardware distribution

An AI product-matching and ordering layer for a Latin American B2B hardware distributor. Fine-tuned multilingual E5 text embeddings, contrastively fine-tuned SigLIP-2 image search, a LangGraph agent with Postgres or Redis checkpointing, and a production stack on AWS RDS, ECS, and Datadog APM.

Forward Deployed Engineer · Direct client engagement

Stack

Python 3.13 · FastAPI · LangGraph · PostgreSQL (AWS RDS) · Alembic · sentence-transformers · fine-tuned multilingual-E5-base · fine-tuned SigLIP-2 (ONNX) · BGE reranker (ONNX) · rank-bm25 · Polars · RapidFuzz · Hugging Face Hub · boto3 · Datadog APM · AWS ECS · AWS CDK · Docker

Outcomes

Production matching pipeline reaches 64.7% top-1 / 83.7% top-5 on a labeled 374-query benchmark; the agent path holds at 64.2% / 83.2% with about 2.1s extra latency per turn.
Multi-turn chat fixtures (53 conversations) reach 96.2% follow-up accuracy after synonym redirection and routing fixes.
Switching to embedding-first retrieval with the fine-tuned E5 raises top-1 from 60.7% to 64.2% and top-5 from 75.7% to 77.3%, and cuts mean latency from 724 ms to 534 ms.
Fine-tuned multilingual E5-base (published to the Hub) trained with CachedMultipleNegativesRankingLoss on ~46k GPT-4o-mini synthetic Spanish queries, ~1.2k benchmark positives, ~1.2k synonym pairs, and hard negatives mined from catalog embeddings.
SigLIP-2 image encoder contrastively fine-tuned on 6,251 product images, exported to ONNX at ~99 ms per query on CPU; INT8 quantization collapsed Recall@5 from 64% to 13% for only a 1.3x speedup, so it was rejected and documented.
LangGraph ReAct agent with JsonPlusSerializer checkpointing across InMemory, AsyncPostgres, and AsyncRedis backends; deterministic bypass for short and SKU-shaped first turns.
Production worker fits in ~3.2 GB RAM with sentence-transformers, ONNX reranker, ONNX image encoder, and both text and image embedding matrices loaded in process.

What I own

The retrieval and matching stack end to end: catalog modelling on Postgres, the embedding strategy (text and image), the FastAPI matching surface, the LangGraph ordering agent and its checkpointing model, the incremental warehouse sync that keeps the catalog and embedding matrices fresh, and the production posture (Datadog APM, structured JSON logs, AWS CDK, ECS rolling deploys). On top of matching, the agentic ordering flows that take a reseller from natural-language intent to a structured order without leaving the conversation.

The retrieval pipeline

A hybrid pipeline with a fine-tuned multilingual E5-base as the first-class encoder. BM25 anchors rare drug-name-style tokens, the dense encoder handles paraphrase and Spanish slang, a BGE cross-encoder reranks the merged pool, and a synonym and alias table corrects long-tail miscalls. Embedding-first routing skips the heavier fuzzy and reranker work for queries that confidently resolve at the encoder layer, which is what produced the +3.5 pp top-1 and -190 ms latency move. A separate Doc2query enrichment pass, which concatenates synthetic queries into catalog text, gave another lift to 81.5% top-5.

Embedding finetune (text)

Backbone intfloat/multilingual-e5-base, fine-tuned with CachedMultipleNegativesRankingLoss on a four-source training set: GPT-4o-mini synthetic Spanish queries (~46k), labeled benchmark positives, synonym-derived positive pairs, and hard negatives mined by nearest neighbors over catalog embeddings in Postgres. Best checkpoint chosen by cosine NDCG@10 on a held-out slice. The frozen baseline lands at Acc@1 45.3%, Acc@5 64.0%, NDCG@10 55.0%; the fine-tuned model is what unlocks the production accuracy and latency posture.

Image search (vision)

A separate pipeline trains a contrastive image-text encoder on top of google/siglip2-base-patch16-256 using SigLIP’s sigmoid loss, with image-to-text Recall@1/5/10 on a deterministic held-out 10% split. Training data is 6,251 active-SKU thumbnails downloaded from the staging API and aligned with synthetic-query text. The fine-tuned weights are exported to ONNX and serve at about 99 ms per query on CPU. INT8 quantization was tried, recorded, and rejected: Recall@5 went from 64% to 13% for only a 1.3x speedup. This is the canonical example I now reach for when someone proposes “just quantize the encoder.”

Agentic ordering (LangGraph)

A LangGraph ReAct loop (llm_call → conditional tool edges → tool_node → back) with two tools: catalog_search, which calls the real matching pipeline, and follow_up_resolve, which uses a FollowUpResolver over the last set of candidates. JsonPlusSerializer whitelists ProductCard, AgentSearchResult, and ChatResponse for checkpoint serialization. The checkpointer is selectable at deploy time across InMemorySaver, AsyncPostgresSaver, and AsyncRedisSaver. Trivial first turns (one token, a SKU-shaped string) bypass the LLM entirely; non-trivial first turns inject the top-2 catalog vocabulary hints (about 20 ms) before the model is allowed to run. The agentic path costs about 2.1s extra latency for a 0.5 pp accuracy delta, which is the right trade for an ordering surface but the wrong trade for a search-only one, and the architecture lets the same service expose both.

In-memory caching

At startup, the match service hydrates catalog rows, signals, the alias map, the BM25 index (pickled from Postgres or rebuilt), the text embedding matrix and SKU ordering, optional vocabulary and synonyms, and the ONNX reranker. The image-search service similarly loads the image embedding matrix and the SigLIP ONNX model. Total resident set lands around 3.2 GB per worker. A nightly sync job upserts warehouse data, embeddings, image embeddings, and BM25 artifacts; operators roll or restart API tasks after a sync so workers reload from the new state. Treating “the cache is reloading fat matrices after batch sync” as a first-class concern (not a side effect) is what keeps the agent honest in production.

Production posture

Multi-stage Docker images, ddtrace-run wrapping the entrypoint when Datadog is enabled, AWS RDS Postgres, Alembic migrations on startup with a SKIP_MIGRATIONS escape hatch, S3 model sync from a ferremundo-models bucket, CircleCI for black/pytest plus ECR build and ECS rolling deploy to a dev cluster, and structured JSON logs with Datadog trace correlation. The health endpoint exposes embedding coverage and image index size, which is what I look at first when something has shifted.

Lessons

B2B hardware search fails in ways leaderboard models never measure. Customers type one word or a slang synonym, and the right SKU is often buried behind a popular default, so you spend more time on disambiguation signals and alias learning than on raw retrieval tricks. Fine-tuned E5 only pays off once you stop treating embeddings as a side channel: embedding-first routing was the largest accuracy move and it lowered latency by skipping fuzzy work most of the time. Hard negatives from your own catalog matter as much as synthetic positives; otherwise the cross-encoder and the alias table fight you when expansion changes surface forms across query and title. Agents look like a small wrapper, but they need checkpointed state and deterministic bypasses for short inputs, or you burn seconds on LLM routing for every SKU lookup. Image search is not “quantize and forget”; INT8 wrecked recall with almost no speed win, which is the kind of result you only get from measuring on real product photos. The “cache” is mostly reloading fat matrices after batch sync, which is simple, but it means deployments and cron are part of the correctness story, not an afterthought.

This engagement is ongoing; this page will be updated as the system evolves.