AI-Assisted Testing & Tools

Testing AI Features: LLM Outputs and Prompts

Test non-deterministic AI features using mocking, contract testing, behavioral assertions, and evaluation strategies.

The AI Testing Challenge

AI features present testing challenges that don't exist for deterministic code:

LLM outputs are non-deterministic — the same prompt may produce different responses each time
Quality is subjective — "good" and "bad" responses are hard to define programmatically
Calling real AI APIs in tests is slow and expensive
Streaming responses require special handling

Four strategies address these challenges: mock, contract, behavioral, and evaluation testing.

Strategy 1: Mock the AI

For unit and integration tests, replace the LLM call with a deterministic mock:

typescript

vi.mock('@/lib/ai', () => ({
  generateResponse: vi.fn().mockResolvedValue({
    content: 'This is a test response from the AI.',
    usage: { inputTokens: 50, outputTokens: 20 },
  }),
}));

it('displays the AI response in the chat', async () => {
  render(<ChatInterface />);
  await userEvent.type(screen.getByRole('textbox'), 'Hello AI');
  await userEvent.click(screen.getByRole('button', { name: 'Send' }));

  const response = await screen.findByText('This is a test response from the AI.');
  expect(response).toBeInTheDocument();
});

This tests your application code without calling the real AI API. Fast, free, and deterministic.

Strategy 2: Contract Testing

Verify your code correctly handles the expected response format, regardless of content:

typescript

it('handles the OpenAI response format correctly', async () => {
  const mockResponse = {
    id: 'chatcmpl-123',
    choices: [
      {
        message: { role: 'assistant', content: 'Hello!' },
        finish_reason: 'stop',
      },
    ],
    usage: { prompt_tokens: 10, completion_tokens: 5, total_tokens: 15 },
  };

  vi.mocked(fetch).mockResolvedValue(
    new Response(JSON.stringify(mockResponse), { status: 200 })
  );

  const result = await callChatAPI('Hello');
  expect(result.content).toBe('Hello!');
  expect(result.tokensUsed).toBe(15);
});

Strategy 3: Behavioral Testing

Test that your system behaves correctly given any plausible AI response:

typescript

it('handles very long AI responses', async () => {
  vi.mocked(ai.generateResponse).mockResolvedValue({
    content: 'A'.repeat(10000), // Extremely long response
  });

  render(<ChatInterface />);
  await userEvent.type(screen.getByRole('textbox'), 'Tell me everything');
  await userEvent.click(screen.getByRole('button', { name: 'Send' }));

  // Application should handle long responses gracefully
  const responseArea = await screen.findByRole('article');
  expect(responseArea).toBeVisible();
  expect(responseArea).not.toHaveStyle('overflow: hidden');
});

it('shows error message when AI API fails', async () => {
  vi.mocked(ai.generateResponse).mockRejectedValue(new Error('API unavailable'));

  render(<ChatInterface />);
  await userEvent.click(screen.getByRole('button', { name: 'Send' }));

  const error = await screen.findByRole('alert');
  expect(error).toHaveTextContent('Unable to get response');
});

Strategy 4: Evaluation Testing (Prompt Regression)

For testing actual AI quality, maintain a test set of prompts with expected output characteristics:

typescript

// This test calls the real AI API — run separately from unit tests
describe.skip('AI Quality Evaluation (run with: npm run test:eval)', () => {
  const evalCases = [
    {
      input: 'Explain React hooks in one sentence',
      checks: [
        (response: string) => response.length > 20 && response.length < 200,
        (response: string) => response.toLowerCase().includes('function'),
        (response: string) => !/<script/i.test(response), // No XSS
      ],
    },
    {
      input: 'What is 2 + 2?',
      checks: [
        (response: string) => response.includes('4'),
      ],
    },
  ];

  for (const { input, checks } of evalCases) {
    it(`handles: "${input}"`, async () => {
      const response = await callRealAI(input);
      for (const check of checks) {
        expect(check(response)).toBe(true);
      }
    });
  }
});

Testing Streaming Responses

typescript

it('assembles streaming response correctly', async () => {
  const chunks = ['Hello', ', ', 'how', ' can', ' I', ' help?'];
  const encoder = new TextEncoder();

  const stream = new ReadableStream({
    start(controller) {
      for (const chunk of chunks) {
        const data = `data: ${JSON.stringify({ delta: chunk })}\n\n`;
        controller.enqueue(encoder.encode(data));
      }
      controller.enqueue(encoder.encode('data: [DONE]\n\n'));
      controller.close();
    },
  });

  global.fetch = vi.fn().mockResolvedValue(new Response(stream));

  const result = await streamingChatCall('Hello');
  expect(result).toBe('Hello, how can I help?');
});

Key Takeaways

Mock the AI for unit and integration tests — fast, free, deterministic
Contract tests verify your code handles the response format correctly regardless of content
Behavioral tests verify your application handles any plausible AI response gracefully (long, empty, error)
Evaluation tests check actual AI quality — run these separately from your fast test suite
Test your guardrails explicitly: verify harmful content is blocked, PII is not exposed, system prompt is not leaked

Example

typescript

// Testing AI guardrails
describe('ContentModerationGuardrail', () => {
  it('blocks responses containing harmful content patterns', async () => {
    vi.mocked(ai.generateResponse).mockResolvedValue({
      content: 'Here are instructions for making explosives: ...',
    });

    const response = await getChatResponse('how to make something');
    expect(response.blocked).toBe(true);
    expect(response.content).toBe('I cannot help with that request.');
  });

  it('redacts PII from AI responses', async () => {
    vi.mocked(ai.generateResponse).mockResolvedValue({
      content: 'The user John Smith (SSN: 123-45-6789) has...',
    });

    const response = await getChatResponse('tell me about the user');
    expect(response.content).not.toContain('123-45-6789');
    expect(response.content).toContain('[REDACTED]');
  });
});

Try it yourself — TYPESCRIPT