// engineering

Why a librarian beats a vector black box

Field notes · Engineering · ~7 min read

Most "memory for agents" products are a vector database with a nice name. You embed everything, you do a similarity search, you hand the top hits to the model. It works until it doesn't, and when it doesn't, you have no idea why. That "no idea why" is the whole problem. Let me explain it with a bug we actually shipped, caught, and fixed — because the bug is the argument.

The promise and the silence

A retrieval system has exactly one job: when the thing you need is in there, give it back. Recall. If you ask for what you stored and get nothing, the system didn't just underperform — it broke its only promise. And the cruel part of pure vector search is that it breaks this promise silently. You get back three plausible-looking results. You don't get back the one that mattered. Nothing tells you it was missed. The black box returned an answer, so it looks like it worked.

We hit exactly this. A note was in the store — correct project, not deleted, plainly relevant. A reasonable query returned other notes and not that one. Zero indication anything was wrong. If we'd trusted the system the way you're supposed to trust a search box, we'd never have known the most relevant note was sitting right there, invisible.

"Sometimes it misses" is not a diagnosis

The first instinct — and I'll own that it was mine — was to file it under edge cases, tune later. Retrieval is fuzzy, ranking is approximate, sometimes the right thing ranks fourth. Ship it, refine in the background.

That instinct is wrong, and it's worth saying why. A retrieval system is deterministic. Same query, same corpus, same result, every time. "Sometimes it misses" is not a property of the system; it's a confession that you don't yet know the rule that governs the miss. There is always a rule. The note doesn't flip a coin about whether to show up. If it's missing, something specific is excluding it, and "sometimes" is just the name you give a cause you haven't found yet.

So we stopped tuning and started asking the only useful question: what exactly is different about the query that fails?

Following the thread

Here's where having a librarian instead of a black box pays for itself.

A vector black box can't answer "why did you miss this." There's no why to inspect — just cosine distances in a space no human reads. But a system that can show its work — what terms it parsed your query into, what it matched on, what it scored and why — lets you run the failure down to its root.

We built a query matrix. Same target note, systematically varied queries, watch which ones find it and which don't:

Query with the plain content words → found.
Query with a particular identifier in it → missed.
Query with that identifier removed → found again.

The pattern wasn't "multi-word queries are flaky." It was specific and reproducible: a certain shape of token broke the match every single time. Not sometimes. Every time. The "sometimes" dissolved the moment we looked, exactly as it always does.

The actual culprit was a tokenizer detail — the full-text layer was treating a slashed identifier (CASE-006/007) as a single path-like token, so the 007 inside it never became a searchable term on its own. Query for 007, and the note that obviously contained it was unreachable. A boring, mechanical, completely deterministic cause. Once named, a one-line normalization fixed it, and the query matrix went all-green.

The point isn't the bug. It's that we could see the bug.

Sit with the counterfactual. In a pure vector system, this same failure mode exists — tokenization, normalization, embedding staleness, all of it can quietly drop the right document. But there's no query matrix to run, because there are no terms to inspect, no match to explain. You'd have shrugged, called it "semantic search being semantic search," and shipped a product that drops the occasional critical note with a perfectly straight face. Your users would hit it, lose the thing they stored, and conclude — correctly — that they can't trust it.

This is why we don't run a vector black box. We run a librarian. A librarian can be wrong, but a librarian can be asked why — it can point at the catalog, show you what it searched, tell you the rule it followed. That auditability isn't a debugging nicety. For a retrieval system, it's the difference between a bug you can fix and a bug you can only apologize for.

To be clear, this isn't "vectors bad." Semantic similarity is genuinely useful and it's part of our stack. The argument is against the black box — against a retrieval layer you cannot interrogate. Embeddings as one ranked signal among several, with the reasoning surfaced: good. Embeddings as an opaque oracle whose misses are unaccountable: that's the thing that quietly loses your work.

What this buys you

The discipline that came out of this is small and it generalizes:

If recall ever returns nothing for something you know is in there, that's not a tuning issue. It's a root-cause issue. Find the rule.

And the architectural commitment underneath it: a retrieval system has to be able to explain its misses. If it can't, you're not running search. You're running a slot machine that occasionally swallows the one note you needed and never tells you it did.

A black box returns an answer. A librarian returns an answer and the reason — and when the reason is wrong, you get to fix it instead of guess.

Retia's retrieval layer is a librarian, not a black box: hybrid full-text and semantic search that surfaces what it matched and why, so a missed note is a bug you can chase to its root — not a silence you have to trust.

← All field notes