Modern AI
Transformers and LLMs
Understand the Transformer architecture that powers GPT, Claude, and all modern LLMs.
The Transformer Architecture
The Transformer was introduced in 2017 in the paper "Attention Is All You Need" by Vaswani et al. at Google. It revolutionized NLP and became the foundation of all modern LLMs.
Key Innovation: Self-Attention
Self-attention allows each word to "attend" to every other word in the sequence, capturing long-range dependencies that RNNs struggled with.
Large Language Models (LLMs)
LLMs are Transformers trained on massive amounts of text. They learn to predict the next token, which surprisingly leads to emergent capabilities:
- GPT (OpenAI): GPT-3, GPT-4, o1
- Claude (Anthropic): Claude 3 Opus, Sonnet, Haiku
- Gemini (Google): Gemini Pro, Ultra, Nano
- Llama (Meta): Open-source models
How LLMs Generate Text
- Tokenize input into subword units
- Each token gets an embedding (dense vector)
- Multiple attention layers process tokens in context
- Output layer predicts probability distribution over vocabulary
- Sample from distribution to get next token
- Repeat until done
Example
python
# Using transformers library (Hugging Face)
# pip install transformers torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
# Quick usage with pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love building AI applications!")
print(result) # [{'label': 'POSITIVE', 'score': 0.9998...}]
# Text generation
generator = pipeline("text-generation", model="gpt2")
output = generator(
"The future of artificial intelligence",
max_new_tokens=50,
num_return_sequences=1
)
print(output[0]['generated_text'])
# Understanding tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Hello, how are you doing today?"
# Tokenize
tokens = tokenizer.encode(text)
token_strings = tokenizer.convert_ids_to_tokens(tokens)
print("Tokens:", tokens)
print("Token strings:", token_strings)
print(f"Character count: {len(text)}")
print(f"Token count: {len(tokens)}")
# Note: tokens != words. "doing" might be one token, "today" another
# Embedding similarity (semantic similarity)
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).float().expand(token_embeddings.size())
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)Try it yourself — PYTHON