Short answer

Start with recursive character chunking at roughly 500–1000 tokens with 10–20% overlap, tuned to your embedding model's preferred input length. Use document-structure-aware chunking for anything that has headings or sections. Reach for semantic, hierarchical, or contextual chunking only after you've measured a retrieval-quality problem that simpler chunking can't fix. The biggest mistake is treating chunking as a "set it and forget it" choice — it's the most under-instrumented part of most RAG pipelines.

Most production RAG failures get blamed on the embedding model or the LLM. Research across enterprise deployments points to retrieval as the primary failure point — poor source data quality and suboptimal chunking are among the leading contributors to those retrieval failures. The LLM didn't hallucinate because the model is dumb; it hallucinated because the relevant fact lived halfway across two chunks, neither of which embedded cleanly enough to be retrieved. The embedding model didn't underperform; it was given a 4,000-token blob to embed when its training distribution was 256 tokens of focused content.

Chunking sits at the seam between two systems — the document store and the retriever — and it's the place where the most invisible quality losses happen. Below are the six chunking strategies that matter in production, when each one is the right move, and the mistakes that quietly tank retrieval before you even get to the interesting parts of the pipeline.

What Chunking Actually Does (and Why It's Harder Than It Looks)

Chunking is the step where a document is split into smaller pieces before being embedded into a vector database. Each chunk becomes one row in your index: an embedding vector representing the chunk's semantic content, the original text, and some metadata. At query time, the user's question gets embedded into the same vector space, and the top-K most similar chunks are pulled back and fed to the LLM as context.

The deceptively simple framing — "just split the document into pieces" — hides three real constraints.

The whole craft of chunking is trading these three constraints off against each other. The strategies below are different ways to make that trade-off.

The 6 Chunking Strategies That Matter

Strategy 01

Fixed-size chunking (with overlap)

Split the document into chunks of a fixed token count (e.g., 500 tokens) with a fixed overlap between consecutive chunks (e.g., 50 tokens). The simplest possible strategy and the right default for ~80% of RAG systems.

Why it works: Predictable chunk sizes mean predictable embedding behavior. Overlap defends against the worst case where the answer lands exactly at a chunk boundary. Fast to compute at indexing time, easy to debug at retrieval time, and good enough for most domains.

Best forFAQs, reference docs, structured technical content, anything where the document doesn't have strong narrative flow
WeaknessSplits at arbitrary positions, so chunks may start mid-sentence or break up tight semantic units
Strategy 02

Recursive character chunking

Split the document by trying a hierarchy of separators in order: paragraph breaks first (\n\n), then sentence breaks (. ), then word breaks, then character breaks — stopping when each chunk is below the target size. This is the strategy LangChain calls RecursiveCharacterTextSplitter and it's a small but meaningful upgrade over fixed-size.

Why it works: By preferring paragraph and sentence boundaries when they exist, the resulting chunks are far more likely to be semantically coherent. Embeddings are higher quality. Retrieval ranking is more stable.

Best forThe new default for almost everything. Replace fixed-size chunking with this; you'll see a measurable retrieval-quality lift on most corpora
WeaknessSlightly slower at indexing; still doesn't understand document structure (headings, lists, tables)
Strategy 03

Document-structure-aware chunking

For documents with explicit structure — Markdown with headings, HTML with sections, JSON with fields — split along those structural boundaries instead of treating the document as flat text. A Markdown chunker uses headings as anchors; an HTML chunker splits on <section> or <article>; a code chunker splits on function or class boundaries.

Why it works: Authors already organized the document into meaningful units. Use them. A heading is a strong human-curated signal that says "everything below me is about this thing." Chunks that respect headings get embeddings that line up cleanly with topic queries.

Best forTechnical docs (Markdown), web content (HTML), source code, structured policy docs, anything with clear section boundaries
WeaknessUseless on flat narrative documents (transcripts, books, customer support tickets) that don't have structure to respect
Strategy 04

Semantic chunking

Use an embedding model to detect topic shifts within the document, and split at those shifts. Concretely: embed each sentence, walk through them looking for big drops in similarity between consecutive sentences, split at those drops. The resulting chunks are coherent units of meaning rather than fixed-size blocks.

Why it works: When it works, it's the highest-quality chunking you can do for narrative documents — each chunk is exactly one topic, embedded cleanly, ranks well at retrieval. When it doesn't work, you've spent a lot of compute at indexing and gotten chunks that are wildly variable in size, hard to budget against context windows, and hard to debug.

Best forLong narrative documents where topics shift but headings don't exist — transcripts, books, support call logs, blog archives
WeaknessExpensive at indexing time; produces unpredictable chunk sizes; harder to debug; the quality lift is real but often not worth the operational complexity
Strategy 05

Hierarchical / parent-document chunking

Store two layers of chunks: small chunks for retrieval (e.g., 200 tokens) and large chunks or full documents for context delivery (e.g., the full section the small chunk came from). Retrieval matches against the small chunks because they embed precisely; the LLM gets the larger parent context so it can actually answer the question.

Why it works: Solves the central tension in chunking — the chunk size that's best for retrieval is usually too small to be useful as LLM context. Decoupling retrieval-time chunks from context-delivery chunks lets you optimize both independently.

Best forComplex documents where the relevant fact is short but needs surrounding context to answer the question (policy docs, legal text, technical specs)
WeaknessDoubles index complexity. You need a way to map small-chunk hits back to their parent context. Harder to operate.
Strategy 06

Contextual chunking (LLM-augmented)

Before embedding each chunk, use an LLM to generate a short summary of how the chunk fits into the overall document, and prepend that summary to the chunk's text. The embedding then captures both the local content and the document-level context.

Why it works: Solves the "this chunk references something earlier in the document" problem. If chunk N+5 says "this policy applies only to the customers described above," embedding chunk N+5 alone produces a vague vector; embedding chunk N+5 with the summary "context: section about premium customer refunds" produces a vector that retrieves correctly.

Best forLong, internally-referential documents where chunks routinely depend on context defined elsewhere — legal contracts, technical specs, multi-section policy documents
WeaknessExpensive at indexing time (one LLM call per chunk). Mostly worth it for the highest-value, lowest-update corpora — not for high-churn knowledge bases

How to Choose: A Decision Path

The honest decision tree for a new RAG system looks like this:

  1. Start with recursive character chunking, ~500–800 tokens, ~10–20% overlap. Build a small evaluation set of representative queries with known correct answers from your corpus. Measure retrieval recall@K on that set. This is your baseline.
  2. If your documents have structure (headings, sections, JSON), upgrade to document-structure-aware chunking. This is almost always a free win for Markdown, HTML, and code corpora.
  3. If your eval set shows that the answer often spans chunks, increase chunk size or overlap. Don't immediately reach for semantic chunking; just give the chunks more room.
  4. If your eval set shows that chunks retrieve correctly but the LLM gives bad answers because context is too thin, try hierarchical chunking. Keep small chunks for retrieval, deliver larger parents to the LLM.
  5. If your eval set shows that chunks routinely reference context defined elsewhere ("this applies to the customers mentioned above"), try contextual chunking — but only on the highest-value, lowest-update part of your corpus.
  6. Reach for semantic chunking last. It's the strategy with the biggest gap between "sounds elegant" and "actually pays off in production."

Five Mistakes That Quietly Tank RAG Quality

If you're building production RAG systems and looking for roles that go deep on this kind of work, the strongest engineering teams hiring for this are listed under our AI / ML engineering roles and the AI Skills hub. The companies with the best RAG infrastructure are usually the ones with strong engineering-driven cultures — where the people closest to the system get to make the calls about how to chunk, retrieve, and serve.

500
Sensible default chunk size (tokens)
10-20%
Sensible default overlap
1
Eval set every RAG system needs

Frequently Asked Questions

What is chunking in RAG and why does it matter?+
Chunking is the step where a document is split into smaller pieces before being embedded into a vector database. It matters because retrieval quality is bounded by chunk quality — if the relevant fact about a customer's refund policy is split across two chunks, neither will match a query well, and the LLM will either hallucinate or refuse to answer. Many production RAG failures that people attribute to the embedding model or the LLM are retrieval failures — and suboptimal chunking is a leading contributor to those retrieval failures.
What is the best chunk size for RAG?+
There is no single best chunk size — the right answer depends on document type and query type. As a rule of thumb, structured documents (technical docs, FAQs) benefit from smaller chunks (roughly 200–500 tokens) because facts are dense and atomic. Narrative documents (blog posts, transcripts, policies) need larger chunks (roughly 500–1000 tokens) because context is needed to disambiguate references. The honest answer is: instrument retrieval quality on real queries from your domain and tune chunk size against that ground truth. Start with 500 tokens and 50-token overlap as a sensible default, then iterate.
Should I use fixed-size or semantic chunking?+
Use fixed-size chunking with overlap as your default. It's simple, predictable, fast to compute, and good enough for most domains. Semantic chunking (splitting at meaning boundaries detected by embeddings) is more expensive at indexing time, harder to debug, and the quality gain is meaningful in some domains (long technical narrative, legal documents) and negligible in others (FAQs, reference docs). Don't reach for semantic chunking until you have a measurable retrieval-quality problem that fixed-size chunking can't solve.
What is the role of chunk overlap?+
Overlap is the technique of letting consecutive chunks share some content (e.g., the last 50 tokens of chunk N appear as the first 50 tokens of chunk N+1). It exists to prevent the worst-case scenario where the answer to a query lands exactly at a chunk boundary and is split across two chunks, neither of which embeds cleanly. The trade-off: more overlap means better recall but a larger index and more duplicate context fed to the LLM. Common defaults are 10–20% of chunk size; tune from there.
How does chunking interact with the embedding model?+
Heavily. Different embedding models have different optimal input lengths — some are trained on short text (queries, sentences), others on long passages. Embedding a 2000-token chunk with a model trained on 256-token inputs produces a low-quality embedding that averages out the document's meaning into noise. Match your chunk size to your embedding model's training distribution. If you switch embedding models, expect to re-evaluate your chunking strategy from scratch.
Do I still need chunking with long-context LLMs?+
Sometimes yes, sometimes no. For small corpora that fit entirely in the context window of a long-context model, you can skip retrieval and chunking entirely — just stuff the documents and ask the question. For larger corpora, chunking is still required because (1) you can't fit the whole corpus in context, (2) cost grows linearly with input tokens so retrieving 5 relevant chunks is far cheaper than dumping 200k tokens every query, and (3) LLM accuracy degrades on long contexts, especially in the middle. The right move in 2026 is to use long-context LLMs to retrieve fewer but larger chunks, not to skip retrieval entirely.

Looking for engineering roles working on RAG / AI infra?

Browse open AI/ML engineering jobs at companies with engineering-driven cultures — the kind where the people building the system get to make the architectural calls.

See AI/ML Roles → AI Skills Hub →