Test Writing With AI

TL;DR

AI is great at generating unit tests for well-structured, single-responsibility functions.
AI struggles with integration tests, E2E flows, and "what should we actually test?"
Use AI for coverage of the obvious. You decide what's worth testing and what edge cases matter.

TestGenEval (68,647 tests, 11 Python repos) shows GPT-5.2 and other leading LLMs at 35–40% average code coverage. They struggle with execution reasoning and complex paths. TestForge (March 2025), a feedback-driven agent, hits 84.3% pass rate and 44.4% line coverage at $0.63/file. Diffblue Cover (autonomous Java) beats Claude Code 3–10x on coverage (54% vs 17% on Apache Tika). The takeaway: pure LLM test gen? Draft quality. Add execution feedback or symbolic methods and it gets real. A user study of 161 devs found LLM-generated tests easier to understand than traditional auto-gen. AI can write tests. The right ones? Still your call.

What AI Does Well

Unit Tests for Pure Functions

Prompt: "Write Jest tests for this function that calculates order total."

What you get: Tests for valid inputs, edge cases like zero and negative (if you ask), maybe a few boundary values. Often correct. Fast to generate.

Why it works: Pure functions have clear inputs and outputs. AI can enumerate cases. Low context, high structure.

Test Skeletons for Components

Prompt: "Write React Testing Library tests for this LoginForm component."

What you get: Renders. Maybe checks for a button. Possibly fills a form and clicks submit.

What's missing: Accessibility assertions. Keyboard navigation. "What if the API returns 401?" Error states. Loading states. Your actual UX flows.

Mocking and Fixtures

Prompt: "Create mock data for a user with 3 orders."

What you get: A plausible JSON structure. Maybe TypeScript types.

What's missing: Data that triggers edge cases. Invalid data. Data that matches your actual API contract.

What AI Does Poorly

Integration Tests

Why: Integration tests depend on real (or realistic) services, DB state, and orchestration. AI doesn't know your infrastructure. It'll give you a sketch. You have to wire it up, handle flakiness, and decide what "success" means.

E2E and User Flows

Why: E2E tests are "click here, then here, then this should happen." AI can generate Playwright or Cypress code — but it doesn't know your app's flow, your selectors, or what "done" looks like. You end up rewriting half of it.

Test Strategy

Why: "What should we test?" is a product and risk question. AI can't answer "we need to test payment flows because we've had 3 prod incidents." You decide scope, priority, and what's in vs. out of scope.

The Coverage Ceiling (2025 Benchmarks)

TestGenEval and Diffblue vs LLM comparisons show: standalone LLMs (GPT-5.2, Claude Sonnet 4.6) plateau around 17–35% line coverage on real codebases. They miss complex branching and integration scenarios. TestForge and Diffblue improve by feeding execution results back — run tests, see failures, regenerate. ASTER (LLM + static analysis) produces more natural, readable tests devs prefer. The pattern: LLM for first draft + feedback loop = better. LLM alone = draft quality, not ship-ready.

Flaky and Over-Specified Tests

Why: AI tends to assert on implementation details (class names, internal state) or timing. Tests that break when you refactor. You spend more time fixing tests than writing them. Human review catches this.

How to Use AI for Testing

Generate first draft. "Write unit tests for this function." Get coverage of the obvious.
Review and prune. Delete tests that don't matter. Fix assertions that are too brittle.
Add what AI missed. Edge cases. Error paths. "What if the user does X?" That's your domain knowledge.
Own the framework. AI doesn't know your team's patterns. Consistent structure, naming, and setup — that's you.

Quick Check

AI generates Jest tests for a pure function. What's typically missing that you need to add?

You manually write every test: happy path, edge cases, mocks, fixtures. Hours per component. Or you skip tests because it's tedious.

Click "With AI" to see the difference →

Do This Next

Generate tests for one function or component with Cursor, Copilot, or ChatGPT. Review the output. How many tests would you keep? What would you add? Compare to the 35% coverage ceiling.
Document one "AI test that's wrong or useless." Build your intuition for when AI tests are worth the review time.
Try a feedback loop. If using an agentic tool (e.g., Cursor's agent mode), have it run tests and regenerate on failure. That's where coverage jumps.