Production
Best Practices
Optimize your vector database for performance, accuracy, and production reliability.
Choosing the Right Embedding Model
The embedding model has the biggest impact on retrieval quality:
- OpenAI text-embedding-3-small: Good quality, cheap (~$0.02/1M tokens)
- OpenAI text-embedding-3-large: Higher quality, 3x the price
- sentence-transformers/all-MiniLM-L6-v2: Free, runs locally, good for most use cases
- BGE models: Strong performance, free, good multilingual support
Optimizing Performance
- Indexing: Use HNSW for best query performance
- Batch operations: Upsert in batches (100-1000 vectors at once)
- Caching: Cache embeddings for repeated content
- Reranking: Use a cross-encoder to rerank top-k results for accuracy
Metadata Design
Design metadata carefully — you can filter by it at query time:
- Include source, date, type, author, section
- Normalize values (lowercase, consistent formats)
- Don't include large text blobs in metadata
Handling Updates
Vectors are identified by ID:
- Use content hash as ID to detect duplicates
- Upsert (insert or update) by ID
- Delete old vectors when documents are removed
Example
python
import hashlib
import json
from typing import Any
# Best practice: Use content hash as ID
def content_hash(content: str, metadata: dict = None) -> str:
"""Generate a stable ID based on content."""
data = content
if metadata:
data += json.dumps(metadata, sort_keys=True)
return hashlib.md5(data.encode()).hexdigest()
# Batch embedding with retry
from tenacity import retry, stop_after_attempt, wait_exponential
from openai import OpenAI, RateLimitError
openai_client = OpenAI()
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10),
retry_on_exception=lambda e: isinstance(e, RateLimitError)
)
def embed_batch(texts: list[str], batch_size: int = 100) -> list[list[float]]:
"""Embed texts in batches with retry logic."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=batch
)
all_embeddings.extend([item.embedding for item in response.data])
return all_embeddings
# Reranking with cross-encoder
def rerank(query: str, documents: list[str], top_k: int = 3) -> list[dict]:
"""Use LLM to rerank documents for better precision."""
from anthropic import Anthropic
client = Anthropic()
prompt = f"""Rate the relevance of each document to the query.
Query: {query}
Documents:
{chr(10).join(f"{i+1}. {doc}" for i, doc in enumerate(documents))}
Return a JSON array of document numbers ordered by relevance (most relevant first).
Example: [3, 1, 2]
Return ONLY the JSON array."""
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
try:
ranking = json.loads(response.content[0].text)
return [{"rank": r, "document": documents[r-1]} for r in ranking[:top_k]]
except Exception:
return [{"rank": i+1, "document": doc} for i, doc in enumerate(documents[:top_k])]Try it yourself — PYTHON