Getting Started
Vector Databases Introduction
Learn what vector databases are and why they are essential for modern AI applications.
What are Vector Databases?
A vector database is a type of database optimized for storing and querying high-dimensional vectors (embeddings). Unlike traditional databases that search by exact match or range, vector databases find the most semantically similar vectors using approximate nearest neighbor (ANN) algorithms.
Why Vector Databases?
With the rise of AI and LLMs, embedding-based search has become fundamental:
- Text embeddings (sentence meaning)
- Image embeddings (visual similarity)
- Audio embeddings (sound similarity)
- Code embeddings (code functionality)
Traditional databases are not optimized for this. A PostgreSQL table with 1 million 1536-dimensional vectors would be extremely slow to search without special indexing.
Key Concepts
- Embedding: A dense vector representation of data
- Similarity search: Find vectors closest to a query vector
- ANN (Approximate Nearest Neighbor): Fast approximate search algorithms (HNSW, IVF)
- Distance metrics: Cosine similarity, Euclidean distance, dot product
- Metadata filtering: Filter by additional attributes alongside similarity
Popular Vector Databases
| Database | Type | Best For |
|---|---|---|
| Pinecone | Managed | Ease of use, scale |
| Chroma | Open source | Development |
| Weaviate | Open source | Multi-modal |
| Qdrant | Open source | Performance |
| pgvector | PostgreSQL extension | Existing Postgres users |
| Milvus | Open source | Enterprise scale |
Example
python
# Understanding embeddings and similarity
import numpy as np
# Embeddings are just vectors of numbers
# Similar concepts have similar vectors
text_a = np.array([0.8, 0.2, 0.1, 0.6]) # "dog"
text_b = np.array([0.7, 0.3, 0.1, 0.5]) # "puppy" (similar to dog)
text_c = np.array([0.1, 0.9, 0.8, 0.2]) # "airplane" (very different)
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"dog vs puppy: {cosine_similarity(text_a, text_b):.3f}") # high
print(f"dog vs airplane: {cosine_similarity(text_a, text_c):.3f}") # low
# In real applications, embeddings are 768-1536 dimensions
# and produced by models like text-embedding-3-small
# Distance metrics
def euclidean_distance(a, b):
return np.linalg.norm(a - b)
def dot_product(a, b):
return np.dot(a, b)
# Cosine similarity is most common for text
# (normalized, so vector magnitude doesn't matter)Try it yourself — PYTHON