Modern AI

Transformers and LLMs

Understand the Transformer architecture that powers GPT, Claude, and all modern LLMs.

The Transformer Architecture

The Transformer was introduced in 2017 in the paper "Attention Is All You Need" by Vaswani et al. at Google. It revolutionized NLP and became the foundation of all modern LLMs.

Key Innovation: Self-Attention

Self-attention allows each word to "attend" to every other word in the sequence, capturing long-range dependencies that RNNs struggled with.

Large Language Models (LLMs)

LLMs are Transformers trained on massive amounts of text. They learn to predict the next token, which surprisingly leads to emergent capabilities:

  • GPT (OpenAI): GPT-3, GPT-4, o1
  • Claude (Anthropic): Claude 3 Opus, Sonnet, Haiku
  • Gemini (Google): Gemini Pro, Ultra, Nano
  • Llama (Meta): Open-source models

How LLMs Generate Text

  1. Tokenize input into subword units
  2. Each token gets an embedding (dense vector)
  3. Multiple attention layers process tokens in context
  4. Output layer predicts probability distribution over vocabulary
  5. Sample from distribution to get next token
  6. Repeat until done

Example

python
# Using transformers library (Hugging Face)
# pip install transformers torch

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

# Quick usage with pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love building AI applications!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.9998...}]

# Text generation
generator = pipeline("text-generation", model="gpt2")
output = generator(
    "The future of artificial intelligence",
    max_new_tokens=50,
    num_return_sequences=1
)
print(output[0]['generated_text'])

# Understanding tokenization
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Hello, how are you doing today?"

# Tokenize
tokens = tokenizer.encode(text)
token_strings = tokenizer.convert_ids_to_tokens(tokens)

print("Tokens:", tokens)
print("Token strings:", token_strings)
print(f"Character count: {len(text)}")
print(f"Token count: {len(tokens)}")
# Note: tokens != words. "doing" might be one token, "today" another

# Embedding similarity (semantic similarity)
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).float().expand(token_embeddings.size())
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
Try it yourself — PYTHON