Building RAG

Document Chunking

Learn how to split documents into optimal chunks for retrieval accuracy.

Why Chunking Matters

LLMs have context window limits. A 10,000 word document can't fit in a single prompt. Chunking splits documents into smaller pieces that:

  • Fit in the context window
  • Can be retrieved by relevance
  • Maintain enough context to be useful

Chunking Strategies

Fixed-size chunking

Split by character count. Simple but may cut sentences mid-way.

Sentence/paragraph chunking

Split at natural boundaries. Better semantic coherence.

Recursive character splitting

Try to split at paragraphs, then sentences, then words. Most common approach (used in LangChain's default splitter).

Semantic chunking

Group sentences by semantic similarity. Most accurate but slower.

Overlap

Use overlap between chunks to prevent losing context at boundaries. Typical overlap: 10-20% of chunk size.

Example

python
import re
from typing import List

class TextChunker:
    def __init__(self, chunk_size: int = 500, overlap: int = 50):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk_by_fixed_size(self, text: str) -> List[str]:
        """Split text into fixed-size character chunks with overlap."""
        chunks = []
        start = 0
        while start < len(text):
            end = start + self.chunk_size
            chunks.append(text[start:end])
            start += self.chunk_size - self.overlap
        return chunks

    def chunk_by_paragraph(self, text: str) -> List[str]:
        """Split by double newline (paragraphs)."""
        paragraphs = [p.strip() for p in re.split(r'\n\n+', text) if p.strip()]
        return paragraphs

    def chunk_by_sentence(self, text: str) -> List[str]:
        """Split into sentence-based chunks."""
        sentences = re.split(r'(?<=[.!?])\s+', text)
        chunks = []
        current_chunk = []
        current_size = 0

        for sentence in sentences:
            if current_size + len(sentence) > self.chunk_size and current_chunk:
                chunks.append(' '.join(current_chunk))
                # Keep overlap
                overlap_sentences = current_chunk[-2:]
                current_chunk = overlap_sentences
                current_size = sum(len(s) for s in current_chunk)

            current_chunk.append(sentence)
            current_size += len(sentence)

        if current_chunk:
            chunks.append(' '.join(current_chunk))

        return chunks

    def add_metadata(self, chunks: List[str], source: str) -> List[dict]:
        """Add metadata to each chunk."""
        return [
            {
                "content": chunk,
                "metadata": {
                    "source": source,
                    "chunk_index": i,
                    "char_count": len(chunk),
                }
            }
            for i, chunk in enumerate(chunks)
        ]

# Test
chunker = TextChunker(chunk_size=200, overlap=30)

long_text = """
Python is a high-level, general-purpose programming language. Its design philosophy
emphasizes code readability using significant indentation. Python is dynamically typed
and garbage-collected. It supports multiple programming paradigms, including structured,
object-oriented and functional programming.

Python was created by Guido van Rossum and first released in 1991. Python consistently
ranks as one of the most popular programming languages.
"""

chunks = chunker.chunk_by_sentence(long_text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1} ({len(chunk)} chars): {chunk[:80]}...")
Try it yourself — PYTHON