Best Practices

Responsible Generative AI

Build generative AI systems responsibly with safety measures, content filtering, and evaluation.

Risks of Generative AI

  • Hallucinations: Models confidently state false information
  • Prompt injection: Users try to override system instructions
  • Harmful content: Generating inappropriate or dangerous content
  • Copyright issues: Training on copyrighted data
  • Bias: Perpetuating stereotypes in generated content
  • Misuse: Generating misinformation, phishing emails, etc.

Safety Measures

  1. System prompts: Set clear boundaries and instructions
  2. Output validation: Check outputs before showing to users
  3. Content moderation: Use moderation APIs or models
  4. Human review: For high-stakes decisions
  5. Rate limiting: Prevent abuse

Evaluation

Build systematic evaluation to test your AI system:

  • Red-teaming: Try to break your system
  • LLM-as-judge: Use another LLM to score outputs
  • Human evaluation: Ground truth for quality

Example

python
import anthropic
import re

client = anthropic.Anthropic()

# Safe system prompt with guardrails
SAFE_SYSTEM_PROMPT = """You are a helpful assistant for a children's educational platform.

Rules:
- Only discuss educational topics appropriate for ages 8-14
- Do not discuss violence, adult content, or inappropriate topics
- If asked about something outside your scope, politely redirect
- Always be encouraging and positive
- Keep responses simple and age-appropriate"""

def safe_generate(user_input: str) -> str:
    # 1. Input validation
    if len(user_input) > 1000:
        return "Please keep your question shorter."

    # 2. Generate with constrained system prompt
    try:
        response = client.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=500,
            system=SAFE_SYSTEM_PROMPT,
            messages=[{"role": "user", "content": user_input}]
        )
        output = response.content[0].text

        # 3. Output validation (simple check)
        inappropriate_patterns = [
            r'\b(kill|murder|hurt|harm)\b',
            r'\b(adult content)\b',
        ]
        for pattern in inappropriate_patterns:
            if re.search(pattern, output, re.IGNORECASE):
                return "I can't help with that topic. Let's talk about something educational!"

        return output

    except anthropic.APIError as e:
        return f"Sorry, I encountered an error. Please try again."

# LLM-as-judge evaluation
def evaluate_response(question: str, response: str) -> dict:
    eval_prompt = f"""Rate this AI response on a scale of 1-5 for each criterion.
Return JSON with keys: accuracy (1-5), helpfulness (1-5), safety (1-5), explanation (str).

Question: {question}
Response: {response}"""

    eval_response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=300,
        messages=[{"role": "user", "content": eval_prompt}]
    )

    import json
    try:
        return json.loads(eval_response.content[0].text)
    except:
        return {"error": "Could not parse evaluation"}

# Test
print(safe_generate("Can you help me learn about the solar system?"))
print(safe_generate("Tell me something inappropriate"))
Try it yourself — PYTHON