Everyone picks their vector database based on latency benchmarks and API ergonomics. Nobody talks about what happens to the invoice three months into production when your embedding count crosses 10 million and your CFO starts asking questions.
The Sticker Price Lie
Managed vector database pricing looks deceptively simple. Pinecone's serverless model charges 16 per million Read Units on Standard, 24 on Enterprise. Qdrant Cloud runs 0.014/hour per node with no per-query fees. Weaviate Cloud charges roughly 0.095 per million vector dimensions per month.
Comparing these numbers directly is meaningless — they measure completely different things. The real cost drivers hide behind three multipliers that don't show up on any pricing page.
Multiplier #1: HNSW Storage Overhead
Every database using HNSW indexing — which is almost all of them — stores a graph structure alongside your raw vectors. That graph adds a 1.5x storage multiplier on top of your vector data. If you're storing 10 million 1536-dimension vectors, you're not paying for 10M × 1536 × 4 bytes = ~57 GB. You're paying for ~86 GB.
It gets worse with higher dimensions. Moving from text-embedding-3-small (1536 dims) to text-embedding-3-large (3072 dims) doesn't just double your storage — it quadruples your effective cost because the graph overhead scales with dimension count. Most teams discover this after they've already committed to a model.
What makes this particularly painful is that the overhead isn't linear across providers. Pinecone abstracts storage behind Read Units, so you won't see the graph cost itemized — it's baked into the per-query price. Qdrant and Weaviate expose it more directly because you're paying for node resources, and you'll notice your RAM usage is way higher than your raw vector math predicted. The practical upshot: when you're capacity planning, take your back-of-envelope vector storage estimate and multiply by 1.5 before you even start comparing pricing tiers. Teams that skip this step end up hitting node memory limits months earlier than expected, triggering an unplanned upgrade that blows the quarterly infra budget.
Also worth noting: quantization helps. Both Qdrant and Weaviate support scalar and product quantization that can compress vectors by 4–8x. But quantization introduces recall degradation — typically 1–3% on standard benchmarks — and the HNSW graph itself doesn't shrink. You save on the vector storage portion, not the index overhead. It's a useful lever, not a silver bullet.
Multiplier #2: The Dimension Tax
The 2026 cost-performance sweet spot for most RAG workloads is text-embedding-3-small at 1536 dimensions. It costs $0.02 per million tokens, gets strong recall on standard retrieval benchmarks, and has native support across Pinecone, Qdrant, Weaviate, and pgvector.
Teams that default to 3072 dimensions "for better quality" rarely measure whether the recall improvement justifies the cost. In my experience, it doesn't — unless you're doing cross-lingual retrieval or your corpus has extreme semantic overlap.
Multiplier #3: Index Rebuilds Nobody Budgets For
Change your embedding model? Switch dimensionality? Full index rebuild. At Pinecone, that runs 12–40 per 10 million vectors. At 100M+ vectors needing migration from ada-002 to something newer, you're looking at a four-figure bill just for the reindex — before you pay for the new embeddings themselves.
This is the cost that kills experimentation. Teams get locked into their first embedding model because the migration cost creates inertia.
When Self-Hosting Starts Winning
The crossover point is well-documented now: at roughly 60–80 million queries per month against a 10 GB namespace, self-hosted Qdrant on a $96/month DigitalOcean droplet (16 GB RAM, 8 vCPU) becomes 3–10x cheaper than Pinecone Serverless.
Below that threshold, managed services make sense. Above it, you're paying a steep convenience tax. The catch is that self-hosting means owning your own uptime, handling shard rebalancing, and debugging memory pressure at 2 AM. Whether that tradeoff works depends on your team's ops maturity — and honestly, most teams overestimate theirs. Running a vector database in production isn't like running Postgres, where decades of community knowledge and tooling smooth over the rough edges. Qdrant's documentation is good but the operational playbook is thin. You'll need to figure out your own backup strategy, your own monitoring thresholds for segment merges, and your own approach to rolling upgrades without dropping queries. If you have a platform team that already manages stateful workloads on Kubernetes, you're probably fine. If "self-hosted" means one developer SSH-ing into a droplet, think hard about what happens when that person is on vacation and the OOM killer fires.
For the middle ground — 10M to 50M vectors, moderate query load — Qdrant Cloud's per-node pricing tends to be the cheapest managed option. No per-query billing means your costs don't spike during traffic bursts.
The Benchmark Numbers, Briefly
Since people always ask: on the standard 1M-vector / 1536-dim benchmark, p50 latencies in 2026 look like this:
| Database | p50 Latency | p99 Latency | Max Scale |
|---|---|---|---|
| Qdrant | 4 ms | 25 ms | Billions |
| Redis | 5 ms | 20 ms | ~100M (RAM-bound) |
| Milvus | 6 ms | 35 ms | Billions |
| Pinecone | 8 ms | 45 ms | Billions |
| Weaviate | 12 ms | 65 ms | Billions |
| pgvector | 18 ms | 90 ms | ~50M |
Redis wins on raw speed but hits a ceiling fast — it's RAM-bound, so you're paying memory prices for every vector. pgvector is slowest but lives inside Postgres, which means zero additional infrastructure. For most RAG applications serving under 100 concurrent users, everything in this table is fast enough.
The pgvector Question
Every team considers pgvector because it means not adding another database. Fair.
But understand the tradeoffs: pgvector maxes out around 50M vectors, indexing throughput is 1,000–5,000 vectors/second versus 8,000–20,000 for Qdrant, and p99 latency at 90 ms means you'll feel it in your RAG pipeline's tail latency. There's also the resource contention angle that people underestimate. Your vector similarity searches are competing for the same connection pool, CPU, and memory as your transactional queries. Run a big batch reindex on pgvector and watch your application's login endpoint slow to a crawl. Purpose-built vector databases don't have this problem because they're isolated by definition.
If your vector count stays under 5M and you're already on Postgres — just use pgvector. Seriously. The operational simplicity of not running a separate database outweighs the performance gap at that scale. Above 20M vectors, or if you need sub-10ms p99, move to a purpose-built solution.
What I'd Actually Do
New RAG project, March 2026: start with pgvector if you're under 5M vectors. Use text-embedding-3-small at 1536 dims. When you outgrow pgvector, migrate to Qdrant Cloud — the per-node pricing model is predictable and the performance is hard to argue with. Only go self-hosted if you're past 60M monthly queries and have someone who actually wants to operate it.
Budget 1.5x your raw vector storage for HNSW overhead. Budget $0 for index rebuilds by picking your embedding model carefully upfront. And stop comparing p50 latencies as if 4 ms versus 12 ms matters when your LLM call takes 800 ms anyway.