Engineering

The PageIndex Question: Tree-Based Retrieval, Real Trade-Offs, and the Hybrid That Actually Ships

cmdev13 min read
The PageIndex Question: Tree-Based Retrieval, Real Trade-Offs, and the Hybrid That Actually Ships
Share
~19 min

It was a screenshot, posted on a Tuesday afternoon, that crossed every engineering feed in a single week.

PageIndex (Vectorless RAG): 98.7% on FinanceBench. GPT-4o Search: 31%.

The replies were mostly the same shape. "The end of chunking." "Vector RAG is over." "Why are we still embedding documents?" A senior research engineer at a hedge fund told us he had four people on his team forwarding the screenshot to him by Wednesday morning. Each had reached the same conclusion. Each was about to run the wrong experiment.

The number is real. The framework is real. The architecture is genuinely new — a reasoning-based retrieval over hierarchical document structure, with no embeddings, no chunking, and no vector database. The GitHub repo has crossed twenty-three thousand stars. The paper is on arXiv. The company behind it, VectifyAI, has published a reproducible evaluation harness. The number is also incomplete in a way that matters if you are paying for the infrastructure.

This piece takes the claim apart on engineering terms. What PageIndex actually does. What the 98.7 per cent includes and excludes. What the marketing leaves out — latency, cost shape, document-class fit, the failure mode at scale. And the hybrid architecture that the honest independent reviewers keep landing on, which is what any team running production retrieval against a real corpus will end up building.

Key takeaways

  • PageIndex segments documents along natural section boundaries and builds a tree; retrieval is an LLM walking the tree node by node. No embeddings, no chunking, no vector store — genuinely a different architecture, not a wrapper.
  • The 98.7% FinanceBench score is real and narrowly defined: single-document QA over SEC filings (a highly structured class). Vector RAG scores ~50% and GPT-4o Search scores 31% on the same benchmark.
  • The marketing leaves out three trade-offs: 2-5 seconds of latency per query (versus milliseconds for vector search), higher per-query cost (every retrieval step is an LLM call), and no graceful path to corpus-scale retrieval across thousands of documents.
  • Tree-based retrieval dominates SEC filings, legal contracts, academic and technical manuals, and audit-defensible single-document workflows. Vector RAG remains right for corpus discovery, latency-sensitive applications, cost-sensitive high-volume workloads, and heterogeneous content.
  • The production architecture is not vector versus tree — it is vector into tree. Vector for corpus discovery, tree for within-document precision, intent classifier routing between them, audit trail recording both paths.

How an analyst opens a 10-K

The architecture is two phases.

The indexing phase parses a PDF or Markdown document. Rather than chunking it into arbitrary 1,000-character spans, segment it along natural section boundaries — the table of contents, headings, page ranges. Build a tree where each node is a section, holding { title, id, page range, short summary }. No embeddings get computed. No vector store gets touched.

The retrieval phase, when a query arrives, sends an LLM to walk the tree. It reads the root summary, decides which child to descend into, repeats at the next level, and stops when it has the leaf node containing the answer. The "search" is sequential reasoning across the document's actual structure, not similarity over embedded chunks.

This is the way a financial analyst opens a 10-K filing. You do not memorise every paragraph of the document. You look at the table of contents, jump to the section that looks right, scan its headers, jump to the page that has the number. PageIndex is automating that exact behaviour with an LLM as the navigator. It is the difference between a librarian who walks you to the right shelf and one who tries to remember every page of every book on every shelf.

It is genuinely a different architecture, not a wrapper over vector search.

The 98.7 per cent is real — and narrowly defined

The benchmark is FinanceBench: a standard test of open-book question-answering over SEC filings — 10-K, 10-Q, 8-K. The evaluation runs the system over a curated set of financial questions, each with a known correct answer and the exact page it should have come from.

VectifyAI's evaluation harness is public (VectifyAI/Mafin2.5-FinanceBench). The 98.7 per cent is reproducible. Traditional vector RAG on the same benchmark scores around 50 per cent. GPT-4o with web search scores 31 per cent.

Three things to hold steady about that number before extrapolating from it.

The benchmark is single-document question answering. Each query is paired with a specific filing and the system answers from that filing. There is no corpus retrieval — no "find the right document among 50,000 contracts" — only "find the right page inside this one document."

The document class is highly structured. SEC filings have stable, deep tables of contents, predictable section numbering, regulator-driven taxonomy. Tree traversal works extraordinarily well on these. It also works on legal contracts, academic textbooks, technical manuals, and anything with a real structural hierarchy.

The benchmark scores accuracy. It does not score latency, cost per query, or token consumption. None of those numbers appear in the leaderboard. The leaderboard is a stopwatch for accuracy. There is no stopwatch for the wallet.

The 98.7 per cent is a real win for a real class of problem. The mistake is reading it as "vector RAG is over." The benchmark does not measure the problem vector RAG is actually solving in most production deployments.

What the post did not show

Three trade-offs are absent from the LinkedIn post that any team planning to deploy this needs to model honestly.

The waiter who has to ask the head chef every time

The independent tests that exist all land in the same range: two to five seconds per query for PageIndex, depending on document complexity. Vector similarity search runs in milliseconds.

The reason is structural, not implementation-level. Every step of the tree traversal is an LLM call. A query that needs to descend three levels through a 10-K is making three serial LLM calls before the system can even read the answer page, and another to synthesise the response. None of those calls are parallelisable in the obvious way — the descent decisions depend on each prior decision. Picture a waiter who has to ask the head chef before bringing each course, then must wait for the previous answer before asking the next question. The food is excellent. Dinner takes four hours.

For a chatbot, two to five seconds is acceptable. For a real-time customer-facing application — a banking portal answering balance enquiries, a support widget retrieving article snippets — it is not. The latency budget has to be designed around the retrieval architecture, not retrofitted to it.

Renting versus owning the bookshelf

Vector RAG amortises its cost. You pay once to embed the corpus — a one-shot job — and then the per-query cost is essentially the vector store lookup (cheap) plus one LLM call to generate the answer. Embedding storage is small. The unit economics work. The bookshelf is built once and stays built.

PageIndex inverts this. There is no embedding step, so no upfront cost. But every retrieval is multiple LLM calls. On a high-traffic deployment, you are running inference for retrieval itself, not just answer generation. Independent tests consistently report that PageIndex costs more per query than vector RAG by a meaningful multiple, though VectifyAI has not published token-consumption numbers and the public repository's README does not include cost analysis. You are not buying a bookshelf. You are paying a researcher to walk into the library and find the page for every question, every time.

The cost shape we wrote about for multimodal RAG pipelines applies here too: the architecture's true cost only becomes visible at production volume, and by then it is structural.

A library, not a book

The benchmark is one-document QA. The architecture does not extend gracefully to thousands of documents.

If you have a corpus of fifty thousand support articles, an enterprise wiki, a legal-document repository, or any large unstructured collection, PageIndex is not the right tool. Vector search excels at this exact problem — which document, among many, is relevant — because it is doing approximate nearest-neighbour over embeddings, which scales to billions of items with sub-100ms latency.

Tree traversal cannot do this. You would have to first identify the relevant document by some other mechanism — which is, in practice, vector search — and only then hand the document to PageIndex for the within-document retrieval. Which is the hybrid pattern the independent reviewers keep arriving at, and which the marketing does not surface.

When the PDF is not a clean filing

The open-source PageIndex parser is the standard PDF library path. It handles clean, well-structured filings cleanly. It does not handle scanned documents, complex multi-column layouts, embedded tables that span pages, or charts with critical numeric data.

The repository's README acknowledges this and points to VectifyAI's paid cloud service for enhanced OCR and tree-building. That is a fair business model, but it is a closed surface — the marketing claim "no vector database" implicitly becomes "no vector database, plus a paid hosted parser for any document that is not a clean 10-K."

Where tree-based retrieval is the right answer

Subtract the corpus-scale problem and the latency-sensitive problem, and a specific class of workload remains where PageIndex genuinely dominates.

SEC filing analysis. Equity research, compliance review, financial due diligence. Long, structured documents. Accuracy matters more than latency. Cost per query is acceptable because the alternative is an analyst doing it by hand.

Legal contract review. Same structural properties — table of contents, defined sections, predictable hierarchy. The "find the indemnification clause" or "what does this say about termination" pattern is exactly what tree traversal handles.

Academic and technical manual QA. Engineering reference texts, regulatory handbooks, internal SOPs at depth. Stable structure. High accuracy bar.

Audit-defensible single-document retrieval. Because every retrieval is traceable back to specific page and section references, the answer is verifiable against the source. For regulated workflows where the auditor has to be able to trace any conclusion back to a specific paragraph, this is a strong fit. We laid out the verifiability pattern in Designing Strict RBAC for Enterprise Knowledge Bases — tree-based retrieval is a structural cousin of that approach.

Where vector RAG remains the right answer

Corpus retrieval at scale. Anything where the first question is "which document," not "where in this document."

Latency-sensitive interactive applications. Customer-facing chat, real-time support, anything with a sub-second response budget.

Cost-sensitive high-volume workloads. Where the per-query cost difference between vector and tree-traversal compounds into a budget problem inside the first quarter of production.

Heterogeneous content. Knowledge bases mixing structured documents, ad-hoc memos, transcripts, tickets, code snippets, and emails. None of these have the kind of clean hierarchical structure tree-based retrieval depends on.

Vector into tree, not vector versus tree

The honest independent reviewers — including the engineers who built and ran their own A/B tests against real corpora — keep arriving at the same conclusion. The production architecture is not vector versus tree. It is vector into tree.

The pattern, in five lines:

  1. The user query hits an intent classifier (cheap, fast, Haiku-tier) that decides whether the question is a corpus-discovery question or a within-document question
  2. Corpus-discovery questions route to vector search over the embedded corpus — fast, cheap, returns the top one or two documents
  3. Within-document questions, or the documents identified by step two, route to tree-based traversal for the deep retrieval inside the document
  4. The synthesis layer combines the retrieved evidence and generates the answer, with citations back to the page and section that tree traversal provides natively
  5. The audit trail records both the corpus-discovery path and the tree-traversal path — so every claim in the answer is traceable to a specific source location

This architecture inherits the strengths of both — vector for scale, tree for precision — and lets the cost and latency budget be governed by the type of question, not the limitations of a single retrieval mechanism. The librarian walks you to the right shelf; the researcher then finds the right page on that shelf. Neither tries to do the other's job.

It is also the only architecture we would put in front of an enterprise client who has to defend the system to a regulator. The marketing line "no vector layer" is correct only for the within-document phase. Production-grade retrieval needs both.

The screenshot is a forcing function, not an answer

A 98.7 per cent benchmark is a real engineering achievement, and PageIndex deserves the attention it is getting. The marketing reading — that this is "the end of chunking" or that vector RAG is obsolete — is not what the benchmark shows. The benchmark shows that for a specific class of single-document question answering over structured filings, reasoning-based traversal outperforms similarity-based retrieval. That is a real result. It does not mean every retrieval problem has the same shape.

The market for enterprise RAG is bifurcating along workload class, not along architecture preference. Teams shipping production AI into regulated environments are going to need to articulate which retrieval mechanism they are using for which kind of question, and why, and what the cost and latency shape of each looks like at production volume. Anyone who shows up with "we use [single architecture] for everything" is going to lose to the team that picked the right mechanism for each problem and documented the trade-offs explicitly.

The screenshot was useful. Not as an answer, but as a question — which is whether your retrieval architecture was ever really one architecture in the first place.

FAQs

Is PageIndex really faster or slower than vector RAG?

Slower at the per-query level, by a meaningful amount. Independent tests land in the 2-5 second range for PageIndex versus milliseconds for vector similarity search. The reason is structural: every level of tree descent is a serial LLM call, and the decisions are not parallelisable because each one depends on the previous one. For a chatbot 2-5 seconds is acceptable; for a real-time customer-facing application it is not.

Does the 98.7% number mean vector RAG is obsolete?

No. The benchmark is single-document QA over highly structured SEC filings — a narrow case where tree traversal genuinely dominates. It does not measure corpus retrieval ("which document among 50,000"), latency-sensitive workloads, or cost shape at production volume. Vector RAG remains the right answer for the problems the benchmark does not measure.

Where does tree-based retrieval clearly outperform vector RAG?

Document classes with stable hierarchical structure and an accuracy bar above the latency bar: SEC filing analysis, legal contract review, academic textbooks and technical manuals, audit-defensible single-document workflows. The common property is a real table of contents and predictable section taxonomy — the conditions under which tree traversal is reading the document the way a human analyst would.

What does the hybrid architecture actually look like?

An intent classifier (Haiku-tier, cheap and fast) routes the query. Corpus-discovery questions go to vector search for the top one or two documents. Within-document questions, or the documents identified by the corpus step, route to tree-based traversal. The synthesis layer combines the evidence with page and section citations. The audit trail records both paths so every claim is traceable.

What happens when the PDF is not a clean structured filing?

The open-source PageIndex parser handles clean filings well — it does not handle scanned documents, complex multi-column layouts, embedded tables spanning pages, or charts with critical numeric data. The README points to VectifyAI's paid cloud parser for these cases, which means the "no vector database" claim implicitly becomes "no vector database, plus a paid hosted parser for anything that is not a clean 10-K."

Companion content

How to engage

We design and ship retrieval architectures for regulated enterprises — vector, tree-based, hybrid, and the audit-grade observability that makes any of them defensible. If you are evaluating tree-based retrieval against your specific corpus and workload, we will help you do the honest comparison before you commit to the architecture. Talk to us at creativeminds.dev/contact.

ragvectorless-ragpageindexvectifyairetrievalamazon-bedrockclaudefinancebenchproduction-aiperspective

Ready to strengthen your security posture?

We help organizations across Africa build resilient infrastructure, deploy AI at scale, and navigate complex regulatory environments.

Start a conversation