AI-Assisted Testing & Tools
Testing AI Features: LLM Outputs and Prompts
Test non-deterministic AI features using mocking, contract testing, behavioral assertions, and evaluation strategies.
The AI Testing Challenge
AI features present testing challenges that don't exist for deterministic code:
- LLM outputs are non-deterministic — the same prompt may produce different responses each time
- Quality is subjective — "good" and "bad" responses are hard to define programmatically
- Calling real AI APIs in tests is slow and expensive
- Streaming responses require special handling
Four strategies address these challenges: mock, contract, behavioral, and evaluation testing.
Strategy 1: Mock the AI
For unit and integration tests, replace the LLM call with a deterministic mock:
typescript
vi.mock('@/lib/ai', () => ({
generateResponse: vi.fn().mockResolvedValue({
content: 'This is a test response from the AI.',
usage: { inputTokens: 50, outputTokens: 20 },
}),
}));
it('displays the AI response in the chat', async () => {
render(<ChatInterface />);
await userEvent.type(screen.getByRole('textbox'), 'Hello AI');
await userEvent.click(screen.getByRole('button', { name: 'Send' }));
const response = await screen.findByText('This is a test response from the AI.');
expect(response).toBeInTheDocument();
});This tests your application code without calling the real AI API. Fast, free, and deterministic.
Strategy 2: Contract Testing
Verify your code correctly handles the expected response format, regardless of content:
typescript
it('handles the OpenAI response format correctly', async () => {
const mockResponse = {
id: 'chatcmpl-123',
choices: [
{
message: { role: 'assistant', content: 'Hello!' },
finish_reason: 'stop',
},
],
usage: { prompt_tokens: 10, completion_tokens: 5, total_tokens: 15 },
};
vi.mocked(fetch).mockResolvedValue(
new Response(JSON.stringify(mockResponse), { status: 200 })
);
const result = await callChatAPI('Hello');
expect(result.content).toBe('Hello!');
expect(result.tokensUsed).toBe(15);
});Strategy 3: Behavioral Testing
Test that your system behaves correctly given any plausible AI response:
typescript
it('handles very long AI responses', async () => {
vi.mocked(ai.generateResponse).mockResolvedValue({
content: 'A'.repeat(10000), // Extremely long response
});
render(<ChatInterface />);
await userEvent.type(screen.getByRole('textbox'), 'Tell me everything');
await userEvent.click(screen.getByRole('button', { name: 'Send' }));
// Application should handle long responses gracefully
const responseArea = await screen.findByRole('article');
expect(responseArea).toBeVisible();
expect(responseArea).not.toHaveStyle('overflow: hidden');
});
it('shows error message when AI API fails', async () => {
vi.mocked(ai.generateResponse).mockRejectedValue(new Error('API unavailable'));
render(<ChatInterface />);
await userEvent.click(screen.getByRole('button', { name: 'Send' }));
const error = await screen.findByRole('alert');
expect(error).toHaveTextContent('Unable to get response');
});Strategy 4: Evaluation Testing (Prompt Regression)
For testing actual AI quality, maintain a test set of prompts with expected output characteristics:
typescript
// This test calls the real AI API — run separately from unit tests
describe.skip('AI Quality Evaluation (run with: npm run test:eval)', () => {
const evalCases = [
{
input: 'Explain React hooks in one sentence',
checks: [
(response: string) => response.length > 20 && response.length < 200,
(response: string) => response.toLowerCase().includes('function'),
(response: string) => !/<script/i.test(response), // No XSS
],
},
{
input: 'What is 2 + 2?',
checks: [
(response: string) => response.includes('4'),
],
},
];
for (const { input, checks } of evalCases) {
it(`handles: "${input}"`, async () => {
const response = await callRealAI(input);
for (const check of checks) {
expect(check(response)).toBe(true);
}
});
}
});Testing Streaming Responses
typescript
it('assembles streaming response correctly', async () => {
const chunks = ['Hello', ', ', 'how', ' can', ' I', ' help?'];
const encoder = new TextEncoder();
const stream = new ReadableStream({
start(controller) {
for (const chunk of chunks) {
const data = `data: ${JSON.stringify({ delta: chunk })}\n\n`;
controller.enqueue(encoder.encode(data));
}
controller.enqueue(encoder.encode('data: [DONE]\n\n'));
controller.close();
},
});
global.fetch = vi.fn().mockResolvedValue(new Response(stream));
const result = await streamingChatCall('Hello');
expect(result).toBe('Hello, how can I help?');
});Key Takeaways
- Mock the AI for unit and integration tests — fast, free, deterministic
- Contract tests verify your code handles the response format correctly regardless of content
- Behavioral tests verify your application handles any plausible AI response gracefully (long, empty, error)
- Evaluation tests check actual AI quality — run these separately from your fast test suite
- Test your guardrails explicitly: verify harmful content is blocked, PII is not exposed, system prompt is not leaked
Example
typescript
// Testing AI guardrails
describe('ContentModerationGuardrail', () => {
it('blocks responses containing harmful content patterns', async () => {
vi.mocked(ai.generateResponse).mockResolvedValue({
content: 'Here are instructions for making explosives: ...',
});
const response = await getChatResponse('how to make something');
expect(response.blocked).toBe(true);
expect(response.content).toBe('I cannot help with that request.');
});
it('redacts PII from AI responses', async () => {
vi.mocked(ai.generateResponse).mockResolvedValue({
content: 'The user John Smith (SSN: 123-45-6789) has...',
});
const response = await getChatResponse('tell me about the user');
expect(response.content).not.toContain('123-45-6789');
expect(response.content).toContain('[REDACTED]');
});
});Try it yourself — TYPESCRIPT