There's a reliable tell for RAG tutorials written before 2024: they explain the pipeline as query → embed → top-k vector search → stuff into prompt → generate. That pattern is Naive RAG. It works fine for simple factual questions against a clean, well-chunked knowledge base. It plateaus at around 70–80% precision for anything more complex.
By 2025, the field had moved to Advanced RAG (re-ranking, hybrid search, query decomposition) and then to Agentic RAG, where the LLM itself controls the retrieval workflow rather than being a passive consumer of whatever chunks the vector search returns. In 2026, the state of the art is Adaptive RAG — a query classifier that routes each question to the appropriate retrieval strategy based on its complexity. Most tutorials haven't caught up.
This guide covers the full evolution: where each pattern fits, how to choose between them, which frameworks make each viable in production, and the production gotchas that tutorials consistently skip. The audience is engineers who understand basic RAG and want the architectural upgrade path.
The RAG Evolution: Naive → Advanced → Agentic
It helps to understand each tier not as a replacement for the previous one, but as a response to its failure modes. Each generation was built because the prior one hit a ceiling.
| Paradigm | How it retrieves | Precision | Cost/query | Latency |
|---|---|---|---|---|
| Naive RAG | Single vector search, top-k chunks | ~70-80% | ~$0.001 | <1s |
| Advanced RAG | Hybrid search + re-ranking + query decomposition | ~85-90% | ~$0.005 | 2–3s |
| Agentic RAG | Agent controls retrieval loop, iterates on results | ~90-95% | ~$0.01–0.05 | 5–15s |
| Adaptive RAG | Classifier routes query to appropriate pipeline | ~90%+ avg | Weighted avg | Weighted avg |
Naive RAG gets you running quickly. The ceiling appears on multi-hop questions, ambiguous queries, and anything that requires reasoning across multiple documents. A question like "How did the approach to LLM fine-tuning differ between the 2023 and 2024 papers in this corpus?" requires finding two different documents, comparing them, and synthesizing the comparison — not something a single top-k retrieval handles well.
Advanced RAG closes much of the gap through better retrieval mechanics. Hybrid search (dense vectors + sparse BM25) improves recall on keyword-heavy queries that pure vector search misses. A re-ranker (a cross-encoder that scores query-document pairs directly rather than relying on cosine distance) significantly improves precision in the top results you actually feed to the LLM. Query decomposition breaks compound questions into atomic sub-queries before retrieval. These improvements are worth implementing before reaching for agentic complexity.
Agentic RAG breaks the single-retrieval assumption entirely. The LLM is no longer downstream of the retrieval step — it drives it. It decides when to retrieve, formulates the retrieval query, evaluates whether the results are sufficient, and retrieves again if they're not. This is what makes it capable of multi-hop reasoning but also what makes it expensive: each retrieval iteration adds tokens and latency.
For a deeper look at the mechanics of Advanced RAG specifically, the RAG architecture guide covers chunking strategies, embedding choices, and hybrid search implementation. The vector databases comparison covers the infrastructure decisions underneath. This article focuses on the agentic layer above both.
How Agentic RAG Works
The core shift is from a pipeline to an agent loop. In a standard RAG pipeline, the retrieval step is fixed: you retrieve once, you use what you get. In Agentic RAG, the agent loop looks like this:
# Agent receives a query query = "What are the tradeoffs between approach A and approach B in doc corpus?" while not answer_sufficient: # Step 1: Agent reasons about what to retrieve next retrieval_query = llm_plan_retrieval(query, retrieved_so_far) # Step 2: Execute retrieval as a tool call results = vector_search(retrieval_query, top_k=5) # Step 3: Agent evaluates results — are they sufficient? evaluation = llm_evaluate_results(query, results, retrieved_so_far) if evaluation.sufficient: answer = llm_synthesize(query, all_retrieved_chunks) answer_sufficient = True else: # Iterate: reformulate query and retrieve again retrieved_so_far += results query = evaluation.refined_query # or escalate if max_iterations hit
Each iteration of this loop involves at least one LLM call (the planning/evaluation step) plus one retrieval call. For queries that require three retrieval iterations, you're running 4–6 LLM calls instead of one. That's the cost profile. The latency is multiplicative, not additive — which is why you need a plan for handling timeouts and latency SLAs before shipping agentic RAG to production users.
The agentic design patterns described in the production agents guide apply here: the agent loop embodies ReAct (reason, act, observe), retrieval is a tool call, and the evaluation step is reflection. Agentic RAG isn't a separate discipline from general agent engineering — it's a specific application of the same patterns, with retrieval as the primary tool.
Four Core Agentic RAG Patterns
Within the broad category of Agentic RAG, four distinct patterns have emerged as production staples. They're not mutually exclusive — production systems often combine two or three.
Self-RAG (introduced in Asai et al., 2023) operationalizes something that most RAG systems do implicitly: deciding whether retrieved content is actually useful before including it in the answer. The four self-reflection checks are:
- IsRetrieve — Does this query even need external retrieval, or can the LLM answer from parametric knowledge? Skipping unnecessary retrieval reduces latency and cost for simple questions.
- IsRelevant — Are the retrieved chunks actually relevant to the query? Irrelevant chunks are discarded rather than included in the prompt, preventing them from confusing the generation step.
- IsSupportive — Is the generated answer actually supported by the retrieved evidence? If not, the model is asked to revise or flag the uncertainty.
- IsUseful — Is the final answer helpful to the user? This gate catches technically correct but incomplete or unhelpfully vague answers.
Production tip: In practice, implementing pure Self-RAG with trained reflection tokens requires a fine-tuned model. In production with off-the-shelf LLMs, you can approximate Self-RAG by wrapping each step in an explicit evaluation prompt that asks the same questions and using structured outputs (JSON with is_relevant: true/false) to gate the pipeline. LlamaIndex has a SelfRAGQueryEngine that implements this pattern cleanly.
Vector search finds semantically similar text. Graph RAG finds structurally related knowledge. The distinction matters when the knowledge in your corpus is connected — entities that reference each other, research papers that build on each other, codebases where functions call other functions.
The Graph RAG pipeline looks like this: ingest documents → extract entities and relationships using an LLM → build a knowledge graph (nodes are entities, edges are relationships) → at query time, identify the relevant subgraph → summarize community clusters → generate the answer from graph-retrieved context.
The community summarization step is what makes Graph RAG uniquely powerful for certain query types. For a question like "What are the main themes across all papers in this corpus that discuss transformer attention mechanisms?", vector search returns the most semantically similar individual chunks. Graph RAG can surface the connected cluster of entities and relationships that collectively address the question — a qualitatively different kind of answer.
When Graph RAG isn't worth it: Graph construction is expensive — you're running an LLM over every document to extract entities and relationships, which adds significant cost and time to your ingestion pipeline. For simple, factual QA against a static knowledge base, the overhead rarely pays off. Graph RAG earns its place when queries are inherently relational, when your corpus has strong entity co-occurrence, and when users ask questions that span multiple documents through shared concepts.
Adaptive RAG is the architectural pattern that makes everything else economically viable in production. The core insight: not all queries are equal, and treating them as if they are wastes either money (running agentic RAG on simple questions) or quality (running naive RAG on complex questions).
The classifier is typically a small, fast model (or a prompted call to the LLM with structured output) that categorizes queries into:
- Direct — factual questions the LLM can answer from parametric knowledge without retrieval. "What is attention in transformers?" → direct answer.
- Single-hop — questions that require one retrieval round against a specific document or chunk. "What did the Q4 2025 earnings report say about revenue growth?" → single vector search.
- Multi-hop — questions requiring synthesis across multiple sources or iterative retrieval. "How have the engineering team's stated priorities evolved across the last three annual reports?" → agentic retrieval loop.
The routing logic can be as simple as a few-shot prompted classifier or as sophisticated as a fine-tuned routing model. In practice, a well-crafted few-shot prompt that asks "How many separate sources or reasoning steps are needed to answer this question?" achieves 85%+ routing accuracy on most workloads.
Multi-Agent RAG makes sense when different parts of the retrieval task require genuinely different expertise or access to different knowledge sources. A financial research assistant might use one agent to retrieve market data, another to retrieve regulatory filings, and a third to retrieve analyst commentary — with an orchestrator synthesizing across all three in parallel. The parallelism alone can recoup some of the cost overhead compared to sequential agentic retrieval.
The same caution from the production agents guide applies: don't reach for multi-agent architecture unless single-agent genuinely can't do the job. The coordination overhead — shared state, routing logic, inter-agent communication — only pays off when the task benefits from specialization or parallelism that single agents can't provide.
Framework Comparison: LangGraph vs LlamaIndex vs AutoGen vs CrewAI
The framework you choose shapes how easily you can implement each of the patterns above. None is universally best — the right choice depends on whether retrieval or orchestration is the core of your system.
In production in 2026, the LlamaIndex + LangGraph combination is the most commonly deployed stack for sophisticated Agentic RAG: LlamaIndex handles the retrieval infrastructure (indexing, chunking, re-ranking, query engines), LangGraph handles the agent orchestration layer (routing, state management, conditional branching). They interoperate cleanly and the observability story is solid through LangSmith.
For a broader comparison of these frameworks across general agentic workloads (not just RAG), see the AI agent frameworks guide. For the LLM evaluation patterns that should sit above any of these frameworks, the LLM evaluation guide covers the practical eval infrastructure you'll need before shipping any of this to production.
When to Use Which Pattern
The decision isn't just about query complexity — it's a matrix of complexity, cost sensitivity, latency requirements, and hallucination tolerance.
Pattern selection guide
The decision tree collapses to one question in practice: what does your query distribution actually look like? Profile your incoming queries before choosing a pattern. If 70% are simple factual lookups, you don't need Agentic RAG — you need Adaptive RAG so those simple queries stay cheap while the complex ones get the treatment they need.
Production Gotchas
The patterns above are the happy path. Here are the failure modes that show up once real users start hitting your system.
Cost management
Agentic RAG cost is highly query-dependent. A simple query that the agent resolves in one retrieval round costs roughly the same as Advanced RAG. A complex query that triggers four retrieval rounds with full re-ranking at each round can cost 20–40x more. You need per-request cost tracking, not just aggregate cost monitoring. Set hard limits on the number of retrieval iterations (3 is usually right; 5 is rarely justified). Build in early-exit logic: if the agent's self-evaluation scores are consistently high after two rounds, stop iterating.
Latency budgets
Retrieval is the bottleneck in 2026, not generation. Each retrieval round adds 200–500ms for the vector search plus another 300–800ms for re-ranking. Three rounds plus three LLM calls lands you at 5–15 seconds for complex queries. Most applications can't present that to users as a synchronous response. The practical patterns: stream intermediate results as they become available (showing "Searching for relevant sources..." progress), run retrieval rounds in parallel where possible, and use async jobs for queries that will clearly need multiple iterations.
Evaluation
RAG evaluation is hard because the right answer depends on both retrieval quality (did we get the right chunks?) and generation quality (did we synthesize them correctly?). You need to evaluate both layers independently. Retrieval metrics: recall@k (are the relevant chunks in the top k?), MRR (mean reciprocal rank), NDCG. Generation metrics: faithfulness (is the answer grounded in the retrieved context?), answer relevance, completeness. Without separate eval for each layer, you can't tell whether quality regressions come from the retrieval side or the generation side.
Context window management
Each retrieval round adds more chunks to the context window. By iteration three, a 128k-token context window fills up fast if chunks are large and re-ranking is returning five chunks per round. Implement dynamic context compression: summarize earlier retrieval rounds rather than keeping full chunks, prioritize chunks by relevance score, and use a sliding window approach that keeps the most relevant evidence in context while compressing older retrievals.
Hallucination in synthesis
Agentic RAG reduces hallucination relative to a single-pass LLM call, but synthesis across multiple retrieved sources introduces a new failure mode: the agent combines two chunks that are actually about different things and generates a claim that's supported by neither. Implement citation grounding: require the agent to attribute each claim to a specific chunk by ID. Any claim without a citation is flagged for human review. This single practice eliminates the majority of synthesis hallucinations.
Building Your First Agentic RAG System
If you're starting from scratch in 2026, here's the implementation path that gets you to a production-viable system without over-engineering from day one:
- Start with Advanced RAG. Implement hybrid search (dense + BM25), add a cross-encoder re-ranker on top of your vector search results, and instrument your evaluation pipeline before you add any agentic complexity. Most use cases never need to go further.
- Add a query classifier. Before reaching for a full agentic loop, add a simple routing classifier. Profile your actual query distribution. If most queries are single-hop, Adaptive RAG routing gives you agentic quality on complex queries and Advanced RAG cost on simple ones.
- Implement agentic retrieval for multi-hop queries only. Using LlamaIndex's
ReActAgentwith a retrieval tool, build the iterative retrieval loop for the queries your classifier flags as multi-hop. Cap iterations at 3. Log every retrieval round. - Add Self-RAG evaluation gates. Wrap your synthesis step with a relevance check and a faithfulness check using structured LLM outputs. Discard chunks that fail IsRelevant. Flag answers that fail IsSupportive for human review rather than returning them silently.
- Consider Graph RAG only if entity relationships drive your use case. Graph ingestion is expensive to run and to maintain. Don't add it speculatively.
The skill set employers look for in 2026 for engineers building agentic RAG systems in production:
The AI Skills hub has structured learning paths for each of these areas. For the broader context of where RAG fits in AI engineering skill stacks, the top AI/ML skills guide for 2026 covers what hiring managers at AI-first companies are actually screening for. And if you're comparing RAG-based approaches against fine-tuning or prompt engineering for your specific use case, this comparison guide maps out the decision criteria.
Build agentic RAG at companies that actually ship it
Browse AI/ML engineering roles at companies building production retrieval systems — filtered by culture, not just title.
Browse AI/ML Jobs → Explore AI Tools →