Back to Blog
Tutorials 9 min read February 20, 2026

Testing AI Features: A Practical Guide to Testing LLM Integrations

Testing code that calls a language model requires a different approach than testing deterministic functions. Here is a practical framework for testing AI features reliably.

DevForge Team

DevForge Team

AI Development Educators

Developer testing AI features and LLM integration with automated test suites

Language models are not deterministic functions. The same prompt can produce different outputs across calls. Outputs can vary in format, length, and content even when the intent is consistent. Testing a feature that wraps an LLM call requires rethinking what you're actually testing — and separating the concerns that can be tested traditionally from those that require a different approach.

This guide provides a practical framework for testing AI-powered features, from unit tests of your integration code to evaluation frameworks for model output quality.

What You Can Test Traditionally

Not everything about an AI feature involves the model. Most of the integration code around LLM calls is deterministic and testable in standard ways:

Input construction: The function that builds your prompt from user input and context. Given user input X and context Y, does it produce prompt Z? This is a pure function — test it with standard assertions.

Output parsing: The function that extracts structured data from model output. Given a model response string, does your parser extract the correct fields? Test this with fixtures of typical model responses.

Error handling: What happens when the API returns an error? When the response is malformed? When a timeout occurs? Test these paths by mocking the API client to return errors.

Retry logic: Does your retry mechanism fire the correct number of times with the correct backoff? Test with a mock that fails N times then succeeds.

Rate limiting and cost controls: Do your token limit checks work correctly? Does your rate limiter reject requests that exceed the threshold?

All of this is testable without ever calling a real language model.

Mocking the LLM Client

In unit tests, mock the LLM API client to return predetermined responses. This makes tests fast, deterministic, and free:

typescript
import { describe, it, expect, vi } from 'vitest';
import OpenAI from 'openai';
import { classifyIntent } from './intent-classifier';

vi.mock('openai');

describe('classifyIntent', () => {
  it('returns the category from the model response', async () => {
    const mockCreate = vi.fn().mockResolvedValue({
      choices: [{
        message: {
          content: JSON.stringify({ category: 'billing', confidence: 0.95 })
        }
      }]
    });

    vi.mocked(OpenAI).mockImplementation(() => ({
      chat: { completions: { create: mockCreate } }
    } as any));

    const result = await classifyIntent('I need help with my invoice');

    expect(result.category).toBe('billing');
    expect(result.confidence).toBeGreaterThan(0.9);
    expect(mockCreate).toHaveBeenCalledOnce();
  });

  it('throws a structured error when the model returns invalid JSON', async () => {
    const mockCreate = vi.fn().mockResolvedValue({
      choices: [{ message: { content: 'I cannot categorize this.' } }]
    });

    vi.mocked(OpenAI).mockImplementation(() => ({
      chat: { completions: { create: mockCreate } }
    } as any));

    await expect(classifyIntent('test input')).rejects.toThrow('Invalid model response format');
  });
});

Testing Output Parsing Robustly

LLMs produce outputs that vary in format. Your parser should handle the variations that occur in practice — not just the ideal output.

Build a fixture library of real model responses collected during development. Include:

  • The ideal formatted response
  • Responses with extra prose before and after the JSON
  • Responses where the model uses slightly different field names
  • Responses where the model adds explanation alongside the data
  • Responses where the model refuses to answer

Test your parser against all of these fixtures:

typescript
import { describe, it, expect } from 'vitest';
import { parseClassificationResponse } from './parser';

const fixtures = {
  idealJson: '{"category": "billing", "confidence": 0.95}',
  withProse: 'Based on the user's message, I'd categorize this as: {"category": "billing", "confidence": 0.95}',
  wrappedInMarkdown: '```json
{"category": "billing", "confidence": 0.95}

refusal: 'I cannot categorize this message as it contains personal information.'

};

describe('parseClassificationResponse', () => {

it('parses ideal JSON', () => {

expect(parseClassificationResponse(fixtures.idealJson).category).toBe('billing');

});

it('extracts JSON when surrounded by prose', () => {

expect(parseClassificationResponse(fixtures.withProse).category).toBe('billing');

});

it('extracts JSON from markdown code blocks', () => {

expect(parseClassificationResponse(fixtures.wrappedInMarkdown).category).toBe('billing');

});

it('returns null for refusal responses', () => {

expect(parseClassificationResponse(fixtures.refusal)).toBeNull();

});

});

text

## Integration Tests with Real Model Calls

Some tests must run against the real API to verify that your prompts actually work. These tests are slow, expensive, and non-deterministic — run them less frequently than unit tests.

import { describe, it, expect } from 'vitest';

import { classifyIntent } from './intent-classifier';

describe.skipIf(!process.env.OPENAI_API_KEY)('classifyIntent integration', () => {

it('classifies billing questions correctly', async () => {

const result = await classifyIntent('When will my subscription renew?');

expect(result.category).toBe('billing');

}, { timeout: 10000 });

it('classifies technical questions correctly', async () => {

const result = await classifyIntent('How do I export my data as CSV?');

expect(result.category).toBe('technical');

}, { timeout: 10000 });

});

text

The skipIf condition ensures these tests only run when an API key is available — they're skipped in PRs but run in dedicated integration test runs.

## Snapshot Testing for Prompts

Your prompts are code. Unintentional changes to prompts can silently degrade model behavior. Use snapshot tests to detect when prompts change:

import { describe, it, expect } from 'vitest';

import { buildClassificationPrompt } from './prompt-builder';

describe('buildClassificationPrompt', () => {

it('matches the expected prompt structure', () => {

const prompt = buildClassificationPrompt({

userMessage: 'test message',

availableCategories: ['billing', 'technical', 'general']

});

expect(prompt).toMatchSnapshot();

});

});

text

When you intentionally change the prompt, update the snapshot. When you accidentally change it (through refactoring, dependency updates, or string concatenation bugs), the snapshot test catches it.

## Evaluation Frameworks for Output Quality

For applications where output quality matters — summarization, content generation, question answering — traditional assertions ("does the output equal X?") don't work because there's no single correct answer.

**LLM-as-judge:** Use a language model to evaluate the output of another language model. This is surprising but effective:

async function evaluateSummaryQuality(

originalText: string,

summary: string

): Promise<{ score: number; reasoning: string }> {

const response = await openai.chat.completions.create({

model: 'gpt-4o',

messages: [{

role: 'user',

content: `Rate this summary on a scale of 1-5 for accuracy and completeness.

Original: ${originalText}

Summary: ${summary}

Respond with JSON: {"score": number, "reasoning": string}`

}]

});

return JSON.parse(response.choices[0].message.content);

}

text

**Deterministic property checks:** Some properties of good AI outputs are checkable without a judge. For a summarization feature:
- Does the summary contain fewer words than the original?
- Are all key entities from the original present in the summary?
- Does the summary avoid introducing facts not in the original?

**A/B testing in production:** For output quality improvements, the most reliable signal is user behavior. Run old and new prompts in parallel for a sample of users and measure which produces better outcomes (clicks, completions, ratings).

## Regression Testing When Upgrading Models

When you upgrade from gpt-4o to a new model version, or from one Anthropic model to another, behavior can change in ways that pass individual tests but degrade the overall experience.

Build a regression suite: a representative sample of real inputs with labeled expected outputs or quality scores. Run this suite against both the old and new model. Compare the distribution of outputs, not just individual test cases. Look for cases where the new model produces significantly different outputs — these require manual review to determine if the change is an improvement or a regression.

## The Testing Pyramid for AI Features

Apply the testing pyramid principle:

- **Many unit tests:** Input construction, output parsing, error handling, retry logic — deterministic code around the LLM
- **Moderate integration tests:** Real API calls verifying that key prompts produce correct outputs for representative inputs
- **Few E2E tests:** Full user workflows involving AI features, verifying the end-to-end experience
- **Ongoing evaluation:** Automated quality scoring against a curated test set, run before every model upgrade

The goal is maximum confidence at minimum cost. Unit tests give you confidence in your integration code cheaply. Integration tests give you confidence in your prompts at moderate cost. Evaluation frameworks give you confidence in output quality at higher cost but catch the subtle regressions that other tests miss.

AI features are not untestable — they require a more sophisticated testing strategy than deterministic code, but the investment pays off in the same way: fewer production bugs, safer refactoring, and the ability to ship improvements with confidence.
    
#AI testing#LLM#Testing#Mocking#Vitest#OpenAI#Anthropic