AI-Assisted Testing & Tools

Testing AI Features: LLM Outputs and Prompts

Test non-deterministic AI features using mocking, contract testing, behavioral assertions, and evaluation strategies.

The AI Testing Challenge

AI features present testing challenges that don't exist for deterministic code:

  • LLM outputs are non-deterministic — the same prompt may produce different responses each time
  • Quality is subjective — "good" and "bad" responses are hard to define programmatically
  • Calling real AI APIs in tests is slow and expensive
  • Streaming responses require special handling

Four strategies address these challenges: mock, contract, behavioral, and evaluation testing.

Strategy 1: Mock the AI

For unit and integration tests, replace the LLM call with a deterministic mock:

typescript
vi.mock('@/lib/ai', () => ({
  generateResponse: vi.fn().mockResolvedValue({
    content: 'This is a test response from the AI.',
    usage: { inputTokens: 50, outputTokens: 20 },
  }),
}));

it('displays the AI response in the chat', async () => {
  render(<ChatInterface />);
  await userEvent.type(screen.getByRole('textbox'), 'Hello AI');
  await userEvent.click(screen.getByRole('button', { name: 'Send' }));

  const response = await screen.findByText('This is a test response from the AI.');
  expect(response).toBeInTheDocument();
});

This tests your application code without calling the real AI API. Fast, free, and deterministic.

Strategy 2: Contract Testing

Verify your code correctly handles the expected response format, regardless of content:

typescript
it('handles the OpenAI response format correctly', async () => {
  const mockResponse = {
    id: 'chatcmpl-123',
    choices: [
      {
        message: { role: 'assistant', content: 'Hello!' },
        finish_reason: 'stop',
      },
    ],
    usage: { prompt_tokens: 10, completion_tokens: 5, total_tokens: 15 },
  };

  vi.mocked(fetch).mockResolvedValue(
    new Response(JSON.stringify(mockResponse), { status: 200 })
  );

  const result = await callChatAPI('Hello');
  expect(result.content).toBe('Hello!');
  expect(result.tokensUsed).toBe(15);
});

Strategy 3: Behavioral Testing

Test that your system behaves correctly given any plausible AI response:

typescript
it('handles very long AI responses', async () => {
  vi.mocked(ai.generateResponse).mockResolvedValue({
    content: 'A'.repeat(10000), // Extremely long response
  });

  render(<ChatInterface />);
  await userEvent.type(screen.getByRole('textbox'), 'Tell me everything');
  await userEvent.click(screen.getByRole('button', { name: 'Send' }));

  // Application should handle long responses gracefully
  const responseArea = await screen.findByRole('article');
  expect(responseArea).toBeVisible();
  expect(responseArea).not.toHaveStyle('overflow: hidden');
});

it('shows error message when AI API fails', async () => {
  vi.mocked(ai.generateResponse).mockRejectedValue(new Error('API unavailable'));

  render(<ChatInterface />);
  await userEvent.click(screen.getByRole('button', { name: 'Send' }));

  const error = await screen.findByRole('alert');
  expect(error).toHaveTextContent('Unable to get response');
});

Strategy 4: Evaluation Testing (Prompt Regression)

For testing actual AI quality, maintain a test set of prompts with expected output characteristics:

typescript
// This test calls the real AI API — run separately from unit tests
describe.skip('AI Quality Evaluation (run with: npm run test:eval)', () => {
  const evalCases = [
    {
      input: 'Explain React hooks in one sentence',
      checks: [
        (response: string) => response.length > 20 && response.length < 200,
        (response: string) => response.toLowerCase().includes('function'),
        (response: string) => !/<script/i.test(response), // No XSS
      ],
    },
    {
      input: 'What is 2 + 2?',
      checks: [
        (response: string) => response.includes('4'),
      ],
    },
  ];

  for (const { input, checks } of evalCases) {
    it(`handles: "${input}"`, async () => {
      const response = await callRealAI(input);
      for (const check of checks) {
        expect(check(response)).toBe(true);
      }
    });
  }
});

Testing Streaming Responses

typescript
it('assembles streaming response correctly', async () => {
  const chunks = ['Hello', ', ', 'how', ' can', ' I', ' help?'];
  const encoder = new TextEncoder();

  const stream = new ReadableStream({
    start(controller) {
      for (const chunk of chunks) {
        const data = `data: ${JSON.stringify({ delta: chunk })}\n\n`;
        controller.enqueue(encoder.encode(data));
      }
      controller.enqueue(encoder.encode('data: [DONE]\n\n'));
      controller.close();
    },
  });

  global.fetch = vi.fn().mockResolvedValue(new Response(stream));

  const result = await streamingChatCall('Hello');
  expect(result).toBe('Hello, how can I help?');
});

Key Takeaways

  • Mock the AI for unit and integration tests — fast, free, deterministic
  • Contract tests verify your code handles the response format correctly regardless of content
  • Behavioral tests verify your application handles any plausible AI response gracefully (long, empty, error)
  • Evaluation tests check actual AI quality — run these separately from your fast test suite
  • Test your guardrails explicitly: verify harmful content is blocked, PII is not exposed, system prompt is not leaked

Example

typescript
// Testing AI guardrails
describe('ContentModerationGuardrail', () => {
  it('blocks responses containing harmful content patterns', async () => {
    vi.mocked(ai.generateResponse).mockResolvedValue({
      content: 'Here are instructions for making explosives: ...',
    });

    const response = await getChatResponse('how to make something');
    expect(response.blocked).toBe(true);
    expect(response.content).toBe('I cannot help with that request.');
  });

  it('redacts PII from AI responses', async () => {
    vi.mocked(ai.generateResponse).mockResolvedValue({
      content: 'The user John Smith (SSN: 123-45-6789) has...',
    });

    const response = await getChatResponse('tell me about the user');
    expect(response.content).not.toContain('123-45-6789');
    expect(response.content).toContain('[REDACTED]');
  });
});
Try it yourself — TYPESCRIPT