Building RAG
Document Chunking
Learn how to split documents into optimal chunks for retrieval accuracy.
Why Chunking Matters
LLMs have context window limits. A 10,000 word document can't fit in a single prompt. Chunking splits documents into smaller pieces that:
- Fit in the context window
- Can be retrieved by relevance
- Maintain enough context to be useful
Chunking Strategies
Fixed-size chunking
Split by character count. Simple but may cut sentences mid-way.
Sentence/paragraph chunking
Split at natural boundaries. Better semantic coherence.
Recursive character splitting
Try to split at paragraphs, then sentences, then words. Most common approach (used in LangChain's default splitter).
Semantic chunking
Group sentences by semantic similarity. Most accurate but slower.
Overlap
Use overlap between chunks to prevent losing context at boundaries. Typical overlap: 10-20% of chunk size.
Example
python
import re
from typing import List
class TextChunker:
def __init__(self, chunk_size: int = 500, overlap: int = 50):
self.chunk_size = chunk_size
self.overlap = overlap
def chunk_by_fixed_size(self, text: str) -> List[str]:
"""Split text into fixed-size character chunks with overlap."""
chunks = []
start = 0
while start < len(text):
end = start + self.chunk_size
chunks.append(text[start:end])
start += self.chunk_size - self.overlap
return chunks
def chunk_by_paragraph(self, text: str) -> List[str]:
"""Split by double newline (paragraphs)."""
paragraphs = [p.strip() for p in re.split(r'\n\n+', text) if p.strip()]
return paragraphs
def chunk_by_sentence(self, text: str) -> List[str]:
"""Split into sentence-based chunks."""
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current_chunk = []
current_size = 0
for sentence in sentences:
if current_size + len(sentence) > self.chunk_size and current_chunk:
chunks.append(' '.join(current_chunk))
# Keep overlap
overlap_sentences = current_chunk[-2:]
current_chunk = overlap_sentences
current_size = sum(len(s) for s in current_chunk)
current_chunk.append(sentence)
current_size += len(sentence)
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
def add_metadata(self, chunks: List[str], source: str) -> List[dict]:
"""Add metadata to each chunk."""
return [
{
"content": chunk,
"metadata": {
"source": source,
"chunk_index": i,
"char_count": len(chunk),
}
}
for i, chunk in enumerate(chunks)
]
# Test
chunker = TextChunker(chunk_size=200, overlap=30)
long_text = """
Python is a high-level, general-purpose programming language. Its design philosophy
emphasizes code readability using significant indentation. Python is dynamically typed
and garbage-collected. It supports multiple programming paradigms, including structured,
object-oriented and functional programming.
Python was created by Guido van Rossum and first released in 1991. Python consistently
ranks as one of the most popular programming languages.
"""
chunks = chunker.chunk_by_sentence(long_text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1} ({len(chunk)} chars): {chunk[:80]}...")Try it yourself — PYTHON