Naive RAG was a great demo. It was a terrible production system. We watched dozens of teams ship 'RAG-powered chat' and watched users churn within 2 weeks because the answers were wrong, citations were broken, and queries that needed multiple lookups returned garbage.
Agentic RAG fixes this. It's not a buzzword - it's a different architecture. Here's what changed.
- Single retrieval pass: Real questions need 2-5 lookups. Vanilla RAG does one.
- No query rewriting: User typos, jargon, and ambiguity destroy embedding similarity. Need a planner to rewrite.
- No verification: The model hallucinates 'based on the docs' even when the docs don't say that. No check, no truth.
- No structured output for follow-ups: The next question needs the previous answer's structure. Vanilla RAG returns prose only.
What agentic RAG looks like in production
- 1
Planner step (cheap model)
First call: take the user query, decompose it into 1-N sub-queries, and decide what tools to call (search, SQL, calculator, web). Use Haiku or Gemini Flash here - this is a routing decision, not a reasoning task.
- 2
Parallel retrieval with reranking
Run sub-queries in parallel against vector + BM25 hybrid search. Rerank with Cohere Rerank 3.5 or Jina Reranker. Top-K=20 candidates per sub-query, kept tight to control context window.
- 3
Synthesis with citation gates
Now call the frontier model (Opus or GPT-5) with retrieved chunks. Prompt enforces: every claim must include a citation. Response is parsed for [src:N] tokens before returning to user.
- 4
Verifier pass
Second LLM call (cheap) checks: does every cited chunk actually support the claim? If not, the response is regenerated or escalated. This is the step that kills hallucinations.
- 5
Memory + follow-up
Cache the planner's decomposition and retrieved chunks for the next turn. Agentic RAG handles follow-ups in 1-2 LLM calls instead of starting over.
How to ship this without losing your mind
Start with the planner only
Adding query decomposition + parallel retrieval is 80% of the win for 20% of the work. Ship that first. Add the verifier in v2.
Use Inngest or Temporal for the agent loop
Don't write your own state machine. Workflow engines handle retries, timeouts, and observability for free.
Log every step to Langfuse or Braintrust
Without per-step traces you cannot debug agentic systems. Period. This is non-negotiable.
Keep a kill switch
Feature-flag the agentic path so you can fall back to vanilla RAG if the planner misbehaves. We've used this 3 times in 18 months.
Production agentic RAG checklist
- 1Planner with sub-query decomposition
- 2Hybrid search (vector + BM25)
- 3Reranker (Cohere or Jina)
- 4Citation enforcement in prompts
- 5Verifier pass on outputs
- 6Inngest / Temporal for orchestration
- 7Langfuse / Braintrust for traces
- 8Per-step latency budgets + circuit breakers
- 9Fallback to vanilla RAG behind feature flag
- 10Eval harness with 100+ golden questions
"Vanilla RAG was a pattern. Agentic RAG is a system."
Vanilla RAG vs agentic RAG
| Stage | Vanilla RAG | Agentic RAG |
|---|---|---|
| Query handling | Single embedding lookup | Planner decomposes into 1-N sub-queries |
| Retrieval | Top-K vector search only | Hybrid (vector + BM25 + filters) with reranking |
| Synthesis | Stuff context, prompt LLM | Citation-enforced prompt + structured output |
| Verification | None | Verifier pass checks claim-citation alignment |
| Follow-ups | Re-runs from scratch | Reuses planner + cached retrieval |
| Hallucination rate | 30-60% | <10% (verifier-gated) |
| Multi-hop accuracy | ~50% | 85-95% |
The planner step (cheap model, big impact)
// Planner runs on Haiku ($0.0001/call)
const decomposition = await haiku.complete({
system: PLANNER_PROMPT,
messages: [{ role: "user", content: query }],
schema: z.object({
sub_queries: z.array(z.string()).min(1).max(5),
tools_needed: z.array(z.enum(["search", "sql", "calc", "web"])),
requires_recent_data: z.boolean(),
}),
})
const results = await Promise.all(
decomposition.sub_queries.map(q => hybridRetrieve(q))
)What every agentic RAG system needs
Hybrid retrieval
Vector + BM25 + metadata filters. One alone misses too much. Combine and rerank.
Cohere or Jina reranker
Top-K=20 candidates, reranker shrinks to top-5. Single biggest accuracy lift.
Citation enforcement
Prompt requires [src:N] tokens. Parse before returning. Drops hallucination by 4-5x.
Verifier loop
Second LLM call checks claim-source alignment. Failed responses are regenerated or escalated.
We rebuild legacy RAG into agentic systems.
Most rebuilds take 4-8 weeks. Typical accuracy lift: 70% → 95% on multi-hop questions.
Agentic RAG checklist
- Planner decomposes queries into sub-questions (cheap model, huge impact).
- Hybrid retrieval (vector + BM25) beats either alone.
- Reranker is the single biggest precision lift you'll add.
- Verifier pass kills hallucinations on factual claims.
- Cache the planner output for follow-up turns.
The future is agentic
Vanilla RAG was a pattern. Agentic RAG is a system. The teams shipping the best AI products in 2026 stopped treating RAG as 'embed-search-respond' and started treating it as a multi-step program where the LLM is one of several primitives.
If you're getting 70-80% accuracy and your users are frustrated, you've outgrown vanilla RAG. The fix is architectural, not prompt engineering.
The IRPR engineering team ships production software for 50+ countries. Idea → Roadmap → Product → Release. 200+ products live.
About IRPR