Production

Best Practices

Optimize your vector database for performance, accuracy, and production reliability.

Choosing the Right Embedding Model

The embedding model has the biggest impact on retrieval quality:

  • OpenAI text-embedding-3-small: Good quality, cheap (~$0.02/1M tokens)
  • OpenAI text-embedding-3-large: Higher quality, 3x the price
  • sentence-transformers/all-MiniLM-L6-v2: Free, runs locally, good for most use cases
  • BGE models: Strong performance, free, good multilingual support

Optimizing Performance

  • Indexing: Use HNSW for best query performance
  • Batch operations: Upsert in batches (100-1000 vectors at once)
  • Caching: Cache embeddings for repeated content
  • Reranking: Use a cross-encoder to rerank top-k results for accuracy

Metadata Design

Design metadata carefully — you can filter by it at query time:

  • Include source, date, type, author, section
  • Normalize values (lowercase, consistent formats)
  • Don't include large text blobs in metadata

Handling Updates

Vectors are identified by ID:

  • Use content hash as ID to detect duplicates
  • Upsert (insert or update) by ID
  • Delete old vectors when documents are removed

Example

python
import hashlib
import json
from typing import Any

# Best practice: Use content hash as ID
def content_hash(content: str, metadata: dict = None) -> str:
    """Generate a stable ID based on content."""
    data = content
    if metadata:
        data += json.dumps(metadata, sort_keys=True)
    return hashlib.md5(data.encode()).hexdigest()

# Batch embedding with retry
from tenacity import retry, stop_after_attempt, wait_exponential
from openai import OpenAI, RateLimitError

openai_client = OpenAI()

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10),
    retry_on_exception=lambda e: isinstance(e, RateLimitError)
)
def embed_batch(texts: list[str], batch_size: int = 100) -> list[list[float]]:
    """Embed texts in batches with retry logic."""
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=batch
        )
        all_embeddings.extend([item.embedding for item in response.data])
    return all_embeddings

# Reranking with cross-encoder
def rerank(query: str, documents: list[str], top_k: int = 3) -> list[dict]:
    """Use LLM to rerank documents for better precision."""
    from anthropic import Anthropic
    client = Anthropic()

    prompt = f"""Rate the relevance of each document to the query.
Query: {query}

Documents:
{chr(10).join(f"{i+1}. {doc}" for i, doc in enumerate(documents))}

Return a JSON array of document numbers ordered by relevance (most relevant first).
Example: [3, 1, 2]
Return ONLY the JSON array."""

    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )

    try:
        ranking = json.loads(response.content[0].text)
        return [{"rank": r, "document": documents[r-1]} for r in ranking[:top_k]]
    except Exception:
        return [{"rank": i+1, "document": doc} for i, doc in enumerate(documents[:top_k])]
Try it yourself — PYTHON