Building RAG
Vector Stores
Store and query embeddings efficiently with vector databases like Chroma and Pinecone.
What are Vector Stores?
A vector store (vector database) is optimized for storing and searching high-dimensional vectors (embeddings) using approximate nearest neighbor (ANN) algorithms.
Options
Open Source:
- ChromaDB: Simple, embedded, great for development
- Weaviate: Scalable, with built-in ML
- Qdrant: Rust-based, fast
- Milvus: Enterprise-grade, distributed
Managed Services:
- Pinecone: Serverless, easy to use
- Supabase pgvector: PostgreSQL extension
- Weaviate Cloud: Hosted Weaviate
Key Operations
- Upsert: Add or update vectors with metadata
- Query: Find most similar vectors to a query vector
- Filter: Filter results by metadata
- Delete: Remove vectors
Choosing a Vector Store
For development: ChromaDB or in-memory
For production: Pinecone, Supabase pgvector, or Weaviate Cloud
Example
python
# Using ChromaDB (simple, embedded)
# pip install chromadb
import chromadb
from chromadb.utils import embedding_functions
# Initialize ChromaDB
client = chromadb.Client()
# For persistence: chromadb.PersistentClient(path="./chroma_db")
# Use OpenAI embeddings (or any embedding function)
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-api-key",
model_name="text-embedding-3-small"
)
# Create a collection
collection = client.create_collection(
name="documentation",
embedding_function=openai_ef,
metadata={"hnsw:space": "cosine"} # distance metric
)
# Add documents
documents = [
"RAG stands for Retrieval Augmented Generation",
"Vector databases store high-dimensional embeddings",
"ChromaDB is an open-source embedding database",
"Semantic search finds documents by meaning",
]
collection.add(
documents=documents,
ids=[f"doc_{i}" for i in range(len(documents))],
metadatas=[{"source": "docs", "section": f"section_{i}"} for i in range(len(documents))]
)
# Query (find most similar documents)
results = collection.query(
query_texts=["What is vector search?"],
n_results=3,
where={"source": "docs"}, # optional metadata filter
include=["documents", "distances", "metadatas"]
)
for doc, distance, metadata in zip(
results['documents'][0],
results['distances'][0],
results['metadatas'][0]
):
print(f"Distance: {distance:.3f} | {doc[:60]}...")
# Update metadata
collection.update(
ids=["doc_0"],
metadatas=[{"source": "docs", "version": "2.0"}]
)
# Get collection stats
print(f"Total documents: {collection.count()}")Try it yourself — PYTHON