PageIndex hit 98.7% accuracy on FinanceBench. Traditional vector RAG hovers around 50% on the same benchmark. That gap is not a typo — it's a paradigm telling you something about how retrieval actually works for structured documents.
What Vectorless RAG Actually Does Differently
The pitch is simple: instead of embedding your documents into high-dimensional vectors and hoping cosine similarity finds what you need, you build a hierarchical tree index that an LLM navigates through reasoning.
Think about how you actually read a 200-page financial filing. You don't scan every sentence for semantic similarity to your question. You check the table of contents, jump to the relevant section, read the surrounding paragraphs, and pull your answer. PageIndex, the open-source framework from VectifyAI, automates exactly that workflow.
The retrieval path looks like this:
User Query
→ LLM examines document tree (table-of-contents-style hierarchy)
→ Reasons about which branches are relevant
→ Drills into specific nodes (pages/sections)
→ Extracts context with full structural awareness
→ Generates answer
No embedding model. No vector database. No chunking strategy debates. No approximate nearest-neighbor index tuning.
Each node in the tree carries a title, page range, summary, and child nodes. The LLM walks this structure in JSON format, deciding at each level whether to go deeper. It's reasoning-based retrieval — the model isn't matching vibes, it's following logic.
The 98.7% Number Deserves Scrutiny
FinanceBench is a benchmark of questions about SEC filings — structured, well-organized documents with clear section hierarchies. This is exactly the kind of content where tree-based navigation shines. GPT-4o scores around 31% on the same benchmark when doing naive retrieval. Perplexity hits roughly 45%.
PageIndex's 98.7% is impressive, but context matters. Financial filings are among the most structurally predictable documents on earth. They follow SEC templates. They have standardized section headings. They are, in a word, indexable.
The harder question: what happens when your corpus is a thousand Slack threads, a mix of markdown docs with inconsistent headings, and a pile of PDFs from three different vendors with no table of contents?
That's where the story gets more complicated.
Where Tree Indexing Falls Apart
Vector search has a genuine advantage in messy, multi-document environments. When you're searching across 50,000 loosely related documents with no consistent structure, semantic similarity is actually the right retrieval primitive. You don't have a tree to navigate — you have a swamp.
Specific failure modes for vectorless approaches:
Unstructured content — chat logs, emails, wiki pages with no heading hierarchy give the tree builder nothing to work with
Cross-document queries — "which vendor offered the lowest price?" across 40 proposals requires scanning all of them, not navigating one tree
Real-time corpus updates — rebuilding a tree index on every document change is expensive; vector indexes handle incremental updates more gracefully
Scale — reasoning through a tree with an LLM costs tokens per query. At 10,000 queries per hour, you're burning serious money on inference
The VectifyAI team acknowledges this implicitly. PageIndex is optimized for "professional documents: financial reports, regulatory filings, academic textbooks, legal/technical manuals." These are exactly the documents where structure is guaranteed.
The Practical Middle Ground
If you're running production RAG infrastructure, the move isn't to rip out your vector store. It's to recognize that retrieval is not one problem.
For structured, high-stakes documents — compliance filings, contracts, technical specs — a reasoning-based approach like PageIndex can dramatically improve accuracy. For broad knowledge bases with diverse, loosely-structured content, vector retrieval with good chunking strategies still works.
The hybrid pattern emerging in production looks something like:
| Document Type | Retrieval Strategy | Why |
|---|---|---|
| SEC filings, contracts | Tree-based (PageIndex) | Predictable structure, high accuracy needed |
| Internal wikis, docs | Hybrid vector + keyword | Mixed structure, broad queries |
| Chat/email archives | Vector similarity | No structure to navigate |
| Code repositories | AST-aware + vector | Structure exists but is domain-specific |
Some teams are already running both: PageIndex for their regulated document corpus, pgvector or Qdrant for everything else, with a routing layer that picks the retrieval strategy based on document type.
What This Means If You're Building RAG Pipelines
Three concrete takeaways for anyone maintaining retrieval infrastructure:
First, your chunking strategy matters less than you think — but only for structured documents. If you've been agonizing over chunk size and overlap for well-organized PDFs, PageIndex sidesteps that entire problem. The tree structure preserves natural document boundaries. No more splitting a table across two chunks and watching your retrieval quality collapse.
Second, the cost profile is different. Vector retrieval is cheap at query time — it's a similarity lookup. Tree-based reasoning burns LLM tokens for every query. On the flip side, you don't need an embedding model or a vector database, which cuts your infrastructure bill. The crossover point depends on query volume. Below a few hundred queries per day, PageIndex is probably cheaper. Above that, the inference costs start to bite.
Third, explainability just got real. When PageIndex returns an answer, it traces to specific pages and sections through a documented reasoning chain. Try getting that from a vector similarity score of 0.82. For compliance-heavy industries where "the AI said so" doesn't cut it, this traceability is a genuine differentiator.
The Setup Is Dead Simple (Which Is Suspicious)
Getting PageIndex running takes about five minutes:
pip3 install --upgrade -r requirements.txt
python3 run_pageindex.py --pdf_path /path/to/filing.pdf --model gpt-4o
It generates a JSON tree, which you then query through LLM reasoning. The framework supports multiple LLM providers via LiteLLM, so you're not locked into OpenAI.
That simplicity should make you nervous. Production RAG isn't hard because of the retrieval algorithm — it's hard because of document ingestion, format normalization, access control, versioning, monitoring, and the hundred other things that happen before a query hits your index. PageIndex solves retrieval elegantly but says nothing about the rest of your pipeline.
So What
Vectorless RAG is a real improvement for a specific class of documents. It's not a vector database killer — the two approaches solve different problems, and most production systems will run both. The teams that benefit most are those already drowning in structured regulatory or financial documents where vector similarity was never the right tool.
If that's you, try PageIndex on one document corpus this week. If your documents look more like a wiki than a filing cabinet, keep your vectors and watch this space.