tokens & tokenization
The process of splitting text into subword units (tokens) that LLMs process. Different tokenizers produce different splits. Token count affects context limits and cost.
Syntax
ai-fundamentals
tokens = tokenizer.encode(text)
text = tokenizer.decode(tokens)Example
ai-fundamentals
# HuggingFace tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer.encode("Hello, world!")
print(tokens) # [15496, 11, 995, 0]
print(len(tokens)) # 4
# Decode back:
print(tokenizer.decode(tokens)) # "Hello, world!"
# Rule of thumb: 1 token ≈ 4 chars (English)