The Part of Vector Search Nobody Benchmarks

Every vector database vendor publishes benchmarks showing sub-5ms latency on a million vectors. Unfiltered. Single client. Static dataset. Then you deploy the thing, add a WHERE tenant_id = 'acme' AND status = 'active' clause, point 100 users at it, and watch P99 latency climb past 800ms.

Benchmarks Test the Wrong Thing

The standard ANN-Benchmarks suite uses SIFT and GIST datasets at 128 dimensions. Modern LLM embeddings sit at 1536 to 3072 dimensions. That alone makes historical comparisons useless — but dimensionality isn't even the real gap.

The real gap: benchmarks don't filter. In production, almost nobody runs bare similarity search. You filter by tenant, by date range, by document type, by access permissions. Reddit's engineering team found this the hard way when they scaled to 340 million vectors. As concurrent users grew, their database spent more time resolving metadata filters than computing similarity distances. The CPU sat idle on disk I/O, shuttling data between the vector graph and the relational metadata store, causing P99 latency to spike 10x over the published averages.

It gets worse. Actian's 2026 evaluation guide surfaced an uncomfortable detail: roughly 30% of vector databases include DeWitt Clauses in their terms of service — legal language that prohibits publishing independent benchmark results without vendor permission. If you can't reproduce a vendor's numbers on your data, with your filters, at your concurrency level, what exactly did the benchmark prove?

Five things production benchmarks should measure but almost never do: metadata filtering under concurrent load, continuous ingestion impact on query latency, P99 tail latency (not averages), index degradation over weeks of operation, and total cost of ownership at 100GB+. Miss any one of these and you're evaluating a demo, not a database.

How Filters Fragment the Graph

HNSW — Hierarchical Navigable Small World — is the index behind most vector databases. It builds a multi-layer graph where each vector connects to its approximate nearest neighbors. Search starts at the top layer, descends through increasingly dense layers, following greedy hops until it converges on an answer. Elegant. Fast. Completely unprepared for boolean predicates.

Say you filter for brand = 'luxury'. The database masks all non-luxury vectors. Connections that served as critical shortcuts now point to hidden nodes. The traversal hits dead ends. What was a tight graph search degrades into restarts, backtracking, and at its worst, a linear scan of whatever subset survived the filter.

The numbers: systems that bolt filtering on top of HNSW — LanceDB with IVF_PQ, OpenSearch, pgvector — show tail latencies spiking to 200–300ms when filters are active. At 10% selectivity (only 10% of vectors match), recall drops because the graph can't route around the 90% that's been removed. You asked for the ten nearest luxury items, but the graph only found six before it ran out of connected nodes.

This is structural, not configurable. B-tree indexes handle boolean predicates natively — removing rows doesn't fragment the tree. HNSW graphs lose their navigability when you carve out nodes. No amount of parameter tuning changes that.

Three Strategies, Three Trade-offs

Approach	Mechanics	Win	Cost
Pre-filter	Apply metadata filter via inverted index, then ANN on the surviving subset	Correct results guaranteed	Large subsets degrade to linear scan; filter evaluation on millions of rows before search even starts
Post-filter	ANN on full index, discard non-matching results	Fast search, dead-simple implementation	Restrictive filters gut your result set — request top-10, get back 2
In-algorithm	Modify the HNSW graph itself to respect metadata during traversal	Recall and speed preserved under filtering	Vendor-specific, harder to debug, limited to databases that invested in this

Qdrant adds intra-category links so the graph stays connected within filtered subsets. Weaviate's ACORN algorithm does two-hop expansions to skip masked nodes without severing traversal paths. Pinecone merges its vector and metadata indexes into a single-stage lookup — on highly selective 1% filters, their search actually runs 35% faster than unfiltered queries because the reduced candidate set shrinks the work.

The most radical approach skips separate filtering entirely. You encode metadata as weighted one-hot dimensions appended to the embedding vector itself, with a scaling factor (α=10) that pushes non-matching vectors far enough away in the embedding space that they naturally fall outside top-k results. Clever for low-cardinality fields. Impractical once you have hundreds of attribute values.

What This Means for Your Stack

If you're standing up a RAG pipeline or recommendation engine beyond a few million vectors, here's the uncomfortable pre-purchase checklist:

Profile your queries. What fraction include metadata filters? If the answer is "all of them" — tenant isolation, RBAC, date ranges — you're shopping for a filtered search engine, not a similarity engine. Benchmark accordingly.

Measure P99 under sustained load. Average latency masks the 10x spikes that happen when a filter creates a graph island. Run your evaluation for a week under continuous ingestion with concurrent readers, not an hour on a frozen dataset.

Question whether you need a dedicated vector store. pgvectorscale on PostgreSQL hit 471 QPS in one evaluation versus 41 QPS from a specialized vector database on identical hardware. If your vectors already live beside relational data you're filtering on, colocating them kills the metadata sync problem and the network hop in one move.

The Awkward Conclusion

The vector database market raised hundreds of millions of dollars on the thesis that similarity search is a specialized problem requiring specialized infrastructure. For pure, unfiltered ANN at billion-vector scale, that thesis holds. But the moment you add a WHERE clause — and in production, you always do — you're back in territory where relational databases have had three decades of optimization.

Sometimes the right architecture is just SELECT * FROM docs ORDER BY embedding <=> $query LIMIT 10 WHERE tenant = $t. No separate vector store. No metadata sync pipeline. No 3 AM page because the filter index drifted.

#Benchmarks Test the Wrong Thing

#How Filters Fragment the Graph

#Three Strategies, Three Trade-offs

#What This Means for Your Stack

#The Awkward Conclusion

Benchmarks Test the Wrong Thing

How Filters Fragment the Graph

Three Strategies, Three Trade-offs

What This Means for Your Stack

The Awkward Conclusion