Chunking Strategies: Splitting Documents for Effective Retrieval

Your RAG system indexes a 40-page technical manual. A user asks “What are the system requirements for the Enterprise plan?” The retrieval returns a chunk that says “…minimum 16GB RAM and 4 CPU cores. For GPU acceleration, an NVIDIA card with…” - it starts mid-sentence and ends mid-paragraph. The chunk before it had “Enterprise plan system requirements:” as its last line. The chunk after it has “…at least 8GB VRAM is recommended.”

The answer exists in your knowledge base. But your chunking strategy split it across three chunks. The retrieval found the middle chunk (it mentions RAM and CPU), but without the header (“Enterprise plan”) and the conclusion (“8GB VRAM recommended”), the model gives an incomplete answer. Worse, it might hallucinate the plan name since the retrieved chunk does not explicitly say “Enterprise.”

Chunking is not just text splitting. It is information architecture. How you chunk determines what your retrieval system can find, what context the model receives, and ultimately whether your RAG system gives correct answers.

What chunking actually is

Chunking is the process of dividing documents into smaller segments that can be independently embedded and retrieved. Each chunk becomes a unit of retrieval - when a user asks a question, you search for and return whole chunks.

The fundamental tension: smaller chunks are more precise (match specific questions closely) but larger chunks provide more context (contain enough information to be useful without additional context).

graph TD
  subgraph spectrum["Chunk Size Spectrum"]
      S["Small (100-200 tokens)
+ Precise matching
+ More chunks fit in context
- May split meaning
- Need more chunks for full answer"]
      M["Medium (400-800 tokens)
+ Good balance
+ Usually self-contained
- May include irrelevant text
- Sweet spot for most use cases"]
      L["Large (1000-2000 tokens)
+ Rich context
+ Fewer split answers
- Less precise matching
- Fewer chunks fit in budget
- May dilute relevance"]
  end

  style S fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style M fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style L fill:#FAEEDA,stroke:#854F0B,color:#633806

Chunking strategies

Fixed-size chunking

The simplest approach: split text every N tokens with optional overlap.

def fixed_size_chunk(text, chunk_size=500, overlap=50):
    tokens = tokenize(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk = tokens[i:i + chunk_size]
        chunks.append(detokenize(chunk))
    return chunks

Pros: Simple, predictable chunk sizes, easy to budget context window. Cons: Splits mid-sentence, mid-paragraph, mid-idea. No awareness of content structure.

When to use: As a baseline or for unstructured content with no clear section boundaries.

Recursive/hierarchical splitting

Split at natural boundaries in order of preference: sections → paragraphs → sentences → words. Try the largest split first; if a section is too big, split at paragraphs within it.

SEPARATORS = ["\n\n\n", "\n\n", "\n", ". ", " "]

def recursive_split(text, max_size=800):
    for separator in SEPARATORS:
        chunks = text.split(separator)
        if all(len(c) <= max_size for c in chunks):
            return chunks
    # If still too large, split at smallest separator
    return split_at_size(text, max_size)

Pros: Respects document structure, keeps paragraphs intact. Cons: Uneven chunk sizes, some chunks may be very small (a single short paragraph).

When to use: General-purpose documents with clear paragraph structure (blog posts, documentation, articles).

Semantic chunking

Use embeddings to determine where to split. Compute sentence embeddings, and split where the semantic similarity between consecutive sentences drops below a threshold - indicating a topic change.

def semantic_chunk(sentences, threshold=0.5):
    embeddings = embed_batch(sentences)
    chunks = []
    current_chunk = [sentences[0]]
    
    for i in range(1, len(sentences)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])
        if similarity < threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])
    
    chunks.append(" ".join(current_chunk))
    return chunks

Pros: Chunks are semantically coherent, topic boundaries are respected. Cons: Expensive (requires embedding every sentence), unpredictable chunk sizes, threshold tuning needed.

When to use: When content has multiple topics within a single page and preserving topic coherence is critical.

Document structure-aware chunking

Use the document’s existing structure (headers, sections, lists) as chunk boundaries:

def structure_aware_chunk(markdown_doc):
    chunks = []
    current_section = {"header": "", "content": ""}
    
    for line in markdown_doc.split("\n"):
        if line.startswith("#"):
            if current_section["content"]:
                chunks.append(current_section)
            current_section = {"header": line, "content": ""}
        else:
            current_section["content"] += line + "\n"
    
    return chunks

Pros: Chunks align with author’s intent, headers provide natural metadata. Cons: Sections vary wildly in size (some might be 3000 tokens), requires structured input.

When to use: Well-structured documentation, wikis, markdown files, HTML with clear heading hierarchy.

Agentic/proposition chunking

Use an LLM to decompose documents into atomic propositions - self-contained factual statements:

Input: "Python was created by Guido van Rossum and first released in 1991. It emphasizes code readability."

Propositions:
1. "Python was created by Guido van Rossum"
2. "Python was first released in 1991"
3. "Python emphasizes code readability"

Pros: Each chunk is a complete, self-contained fact. Extremely precise retrieval. Cons: Expensive (LLM call per document), loses surrounding context, generates many tiny chunks.

When to use: Fact-dense reference material where precise retrieval of individual facts matters more than surrounding context.

graph LR
  subgraph strategies["Strategy Selection Guide"]
      Q1["Document has clear structure?"]
      Q2["Topics change within pages?"]
      Q3["Need precise fact retrieval?"]
      Q4["Budget for processing?"]
      
      A1["Structure-aware chunking"]
      A2["Semantic chunking"]
      A3["Proposition chunking"]
      A4["Recursive splitting"]
  end

  Q1 -->|"Yes"| A1
  Q2 -->|"Yes"| A2
  Q3 -->|"Yes"| A3
  Q4 -->|"Low"| A4

  style A1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style A2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style A3 fill:#FAEEDA,stroke:#854F0B,color:#633806
  style A4 fill:#F1EFE8,stroke:#888780,color:#444441

Overlap: the insurance policy

Adding overlap between chunks ensures that information at chunk boundaries is not lost. Typical overlap: 10-20% of chunk size.

Chunk 1: tokens 0-499 + overlap tokens 500-549
Chunk 2: tokens 450-949 + overlap tokens 950-999
Chunk 3: tokens 900-1399 ...

Overlap increases storage and retrieval noise (the same passage might be in multiple chunks) but catches boundary cases where the answer spans two chunks.

The parent document pattern

A powerful technique that decouples retrieval granularity from context granularity:

Index small chunks (200-300 tokens) for precise semantic matching
Return parent chunks (1000-1500 tokens) for rich generation context

When retrieval finds a small chunk, look up its parent (the larger section it belongs to) and return that instead. This gives you the precision of small chunks with the context richness of large chunks.

# During ingestion
parent_chunk = section_text  # 1000 tokens
child_chunks = split_into_small(parent_chunk, size=200)

# Store mapping
for child in child_chunks:
    store(child, metadata={"parent_id": parent_chunk.id})

# During retrieval
matched_child = vector_search(query)
parent = get_parent(matched_child.parent_id)
return parent  # Return the rich parent context

Where chunking breaks

The “stranded reference” problem

A chunk says “As mentioned in the previous section, the timeout is…” but the previous section is in a different chunk. The model cannot resolve the reference. Solutions: include section headers as context in every chunk, use overlap, or use parent document retrieval.

Table and list splitting

Tables split across chunk boundaries are useless. A chunk containing rows 5-10 of a table without the header row is uninterpretable. Solutions: keep tables as single chunks regardless of size, or include the table header with every table chunk.

Code splitting

Splitting a code file at arbitrary token boundaries produces broken, unparsable chunks. Better: split at function/class boundaries using AST parsing. Each function becomes its own chunk with its imports and class context included.

Varying document types

A one-size-fits-all chunking strategy fails when your knowledge base contains FAQs (short Q&A pairs), technical manuals (long sections), API docs (structured references), and meeting notes (unstructured). Use different chunking strategies per document type.

Real-world approaches

LangChain - provides RecursiveCharacterTextSplitter, MarkdownTextSplitter, and others as composable building blocks
LlamaIndex - offers SentenceSplitter, SemanticSplitter, and hierarchical node parsing with parent-child relationships
Pinecone - recommends 200-400 token chunks for their embedding models, with metadata for filtering
Notion - chunks at block level (each paragraph, heading, list, or table is a natural chunk)
Anthropic’s contextual retrieval - adds a short context summary to each chunk during indexing (generated by LLM) that describes what the chunk is about within the larger document

How to apply in practice

Start with recursive splitting at 500-800 tokens. This is the highest-ROI starting point for most document types. Only add complexity when your eval reveals specific chunking-related failures.

Always include metadata with chunks. At minimum: document title, section heading, chunk position (first/middle/last in section), and source URL. This metadata powers filtering and helps the model cite sources.

Test with your actual queries. Take 50 real user queries. For each, manually identify which document section contains the answer. Then check: does your chunking strategy keep that section intact? If it is split across chunks, your strategy needs adjustment.

Measure retrieval precision at the chunk level. Of the top-5 retrieved chunks, how many actually contain relevant information? Low precision means your chunks are either too large (matching too broadly) or split poorly (matching on irrelevant fragments within a chunk).

Use different strategies for different content types. Maintain a chunking configuration per document type:

CHUNKING_CONFIG = {
    "faq": {"strategy": "qa_pairs", "size": 200},
    "technical_docs": {"strategy": "structure_aware", "size": 800},
    "api_reference": {"strategy": "per_endpoint", "size": 400},
    "meeting_notes": {"strategy": "recursive", "size": 600, "overlap": 100},
}

FAQ

Q: What is the optimal chunk size?

There is no universal optimum - it depends on your content, queries, and embedding model. That said, 400-800 tokens works well for the majority of use cases as a starting point. Smaller (200-300) for fact-dense FAQ-style content. Larger (1000-1500) for narrative content where context matters. The real answer: test 3-4 sizes on your eval set and pick the one with the highest retrieval precision.

Q: Should I use overlap? How much?

Use overlap (10-20% of chunk size) unless you are using a strategy that already handles boundaries (structure-aware, semantic). Overlap is insurance against boundary splits. The cost is slightly more storage and occasional duplicate retrieval. For a 500-token chunk, 50-100 tokens of overlap is typical. More than 25% overlap is wasteful - at that point, your chunks are mostly the same content.

Q: How do I chunk PDFs, which often have broken text flow (columns, headers, footers)?

Pre-process PDFs before chunking. Use a layout-aware PDF parser (Adobe PDF Extract API, Unstructured.io, or PyMuPDF with layout analysis) that reconstructs reading order, removes headers/footers, and handles multi-column layouts. Only after you have clean, linear text should you apply chunking. For complex PDFs (scanned documents, forms), consider using a multimodal model to process page images directly rather than attempting text extraction.

Interview questions

Q: You are building a RAG system over your company’s technical documentation (500 documents, 200 pages average, mix of tutorials, API references, and architecture guides). Design your chunking strategy.

Multi-strategy approach based on document type: (1) API references - chunk per endpoint (each endpoint’s description, parameters, and examples as one chunk, ~300-500 tokens). (2) Tutorials - structure-aware chunking at section boundaries (h2/h3 headers), targeting 600-800 tokens per chunk with the section heading prepended. (3) Architecture guides - semantic chunking to detect topic transitions within long sections, targeting 800-1000 tokens. For all types: include document title and section path as metadata, add 50-token overlap between adjacent chunks, keep tables and code blocks intact (never split mid-table or mid-function). Index with parent document pattern: store both the precise chunk and a reference to the full section for retrieval context enrichment.

Q: Your RAG system retrieves relevant chunks but users complain answers are “incomplete” - they only get part of the answer. Diagnose and fix.

The answer likely spans multiple chunks. Diagnostic: for 20 “incomplete” queries, manually check if the full answer exists in one chunk or requires multiple. If split across chunks: (1) Increase chunk size at the relevant boundary type (if splitting mid-paragraph, use paragraph-level chunking). (2) Add more overlap to catch boundary information. (3) Implement parent document retrieval - find the small matching chunk but return its larger parent section. (4) Increase retrieval k to get adjacent chunks, then merge them before sending to the model. (5) Add a “context expansion” step: when a retrieved chunk starts or ends mid-sentence, automatically include the adjacent chunk. Measure after each fix to confirm improvement.

Q: Compare fixed-size chunking vs semantic chunking. What are the tradeoffs in terms of quality, cost, and implementation complexity?

Fixed-size: O(n) processing time, deterministic output, even chunk sizes that simplify budgeting, but ignores content structure and splits meaning arbitrarily. Best for: large-scale ingestion where processing time matters and content is relatively homogeneous. Semantic: requires embedding every sentence (expensive at scale), variable chunk sizes complicate budget management, but produces semantically coherent chunks aligned with topic boundaries. Best for: high-value content where retrieval precision justifies processing cost. The cost difference is significant: semantic chunking of 10,000 documents requires millions of sentence-level embedding calls during ingestion, while fixed-size requires only the final chunk-level embeddings. For most teams, recursive splitting (a middle ground) provides 80% of semantic chunking’s benefits at 10% of the cost.