tokens & tokenization

The process of splitting text into subword units (tokens) that LLMs process. Different tokenizers produce different splits. Token count affects context limits and cost.

Syntax

ai-fundamentals

tokens = tokenizer.encode(text)
text = tokenizer.decode(tokens)

Example

ai-fundamentals

# HuggingFace tokenizer:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer.encode("Hello, world!")
print(tokens)        # [15496, 11, 995, 0]
print(len(tokens))   # 4

# Decode back:
print(tokenizer.decode(tokens))  # "Hello, world!"

# Rule of thumb: 1 token ≈ 4 chars (English)