Getting Started

Vector Databases Introduction

Learn what vector databases are and why they are essential for modern AI applications.

What are Vector Databases?

A vector database is a type of database optimized for storing and querying high-dimensional vectors (embeddings). Unlike traditional databases that search by exact match or range, vector databases find the most semantically similar vectors using approximate nearest neighbor (ANN) algorithms.

Why Vector Databases?

With the rise of AI and LLMs, embedding-based search has become fundamental:

  • Text embeddings (sentence meaning)
  • Image embeddings (visual similarity)
  • Audio embeddings (sound similarity)
  • Code embeddings (code functionality)

Traditional databases are not optimized for this. A PostgreSQL table with 1 million 1536-dimensional vectors would be extremely slow to search without special indexing.

Key Concepts

  • Embedding: A dense vector representation of data
  • Similarity search: Find vectors closest to a query vector
  • ANN (Approximate Nearest Neighbor): Fast approximate search algorithms (HNSW, IVF)
  • Distance metrics: Cosine similarity, Euclidean distance, dot product
  • Metadata filtering: Filter by additional attributes alongside similarity

Popular Vector Databases

DatabaseTypeBest For
PineconeManagedEase of use, scale
ChromaOpen sourceDevelopment
WeaviateOpen sourceMulti-modal
QdrantOpen sourcePerformance
pgvectorPostgreSQL extensionExisting Postgres users
MilvusOpen sourceEnterprise scale

Example

python
# Understanding embeddings and similarity
import numpy as np

# Embeddings are just vectors of numbers
# Similar concepts have similar vectors
text_a = np.array([0.8, 0.2, 0.1, 0.6])  # "dog"
text_b = np.array([0.7, 0.3, 0.1, 0.5])  # "puppy" (similar to dog)
text_c = np.array([0.1, 0.9, 0.8, 0.2])  # "airplane" (very different)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"dog vs puppy: {cosine_similarity(text_a, text_b):.3f}")   # high
print(f"dog vs airplane: {cosine_similarity(text_a, text_c):.3f}") # low

# In real applications, embeddings are 768-1536 dimensions
# and produced by models like text-embedding-3-small

# Distance metrics
def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

def dot_product(a, b):
    return np.dot(a, b)

# Cosine similarity is most common for text
# (normalized, so vector magnitude doesn't matter)
Try it yourself — PYTHON