AI Reliability and Reproducibility

TL;DR

LLMs are non-deterministic by default. Same prompt, different output. In production, that's a bug.
The fix: temperature=0, seed parameter, structured outputs, caching, model version pinning. Stack them.
The real culprit isn't just GPU randomness — it's batch size variability during inference. Thinking Machines Lab proved 100% determinism is achievable (Feb 2026).
You won't get bit-for-bit identical outputs from every provider yet. But "functionally equivalent" — same schema, same classification, same decision — is achievable today. That's the real goal.

add(2, 3) returns 5. Every time. On every machine. That's a pure function. Now try: classify("I love this product") through an LLM. First run: positive. Second run: very positive. Third run: positive sentiment detected. Same input. Three different outputs. Your downstream code just broke three different ways.

Here's how to close the gap.

Why LLMs Aren't Pure Functions

Four culprits — and one that matters more than you think:

Temperature and sampling. Temperature > 0 = the model rolls dice on which token comes next. Different roll, different output.
Silent model updates. Provider ships a new version. Your prompts produce different outputs overnight. No code change on your side. You find out when a customer complains.
Batch size variability — the real culprit. This one surprised everyone. In Feb 2026, Mira Murati's Thinking Machines Lab published research showing the primary cause of non-determinism isn't floating-point math or GPU concurrency — it's how requests get batched. Your prompt gets grouped with other requests into a "carpool." Busy system = big batch. Quiet system = small batch. Different batch size = different order of math operations = different output. Even at temperature=0.
Butterfly-effect inputs. One extra space. Slightly different word order. The model takes a completely different path. Like chaos theory, but for text.

Fun fact from MIT (Feb 2026): LLM ranking platforms are so fragile that removing just 2 votes out of 57,000 — that's 0.0035% — can flip which model tops the leaderboard. If benchmarks themselves aren't deterministic, imagine your production pipeline.

Call LLM with same prompt 5 times. Get 5 different JSON shapes. Downstream parser crashes on 3 of them. Engineer spends hours debugging 'flaky AI.' Retries make it worse — each retry is a new conversation with a different outcome.

Click "With Reproducibility Stack" to see the difference →

The Reproducibility Toolkit

Think of these as layers. Stack more for higher stakes.

1. Temperature = 0

The single biggest lever. The model picks the highest-probability token every time instead of sampling. Not perfectly deterministic (batch variance), but dramatically more consistent.

Use for: Extraction, classification, structured data — anything where creativity is a liability. Skip for: Creative writing, brainstorming, diverse suggestions — temperature 0.7–1.0 is the point there.

2. Seed Parameter

OpenAI and others support a seed parameter. Same seed + same input = same output (within a model version). Think Math.random(seed) — reproducible randomness. Caveat: model update = seed breaks. Always pair with version pinning.

3. Structured Outputs (JSON Schema)

Force the model to return a specific shape. OpenAI's response_format: { type: "json_schema" }, Anthropic's tool-use schemas, Google's controlled generation — they all constrain the output space. sentiment locked to ["positive", "negative", "neutral"] can't drift to "kinda positive." Schema = guardrails. Fewer degrees of freedom = more reproducible.

4. Caching

The purest "pure function" move: seen this exact input before? Return the cached output. No LLM call. Identical every time. Zero cost. Hash the input (prompt + context + parameters) → cache key. Set a TTL. Best for high-volume, low-variance queries — FAQ bots, repeated classifications.

5. Model Version Pinning

Never use gpt-4o in production. Use gpt-5.2-2026-02-10 or a specific dated version. Pin the exact version. Same logic as pinning npm packages — you'd never deploy with "openai": "^4.0.0". Don't do it with your model either. Upgrade deliberately, with regression testing.

6. Output Validation + Post-Processing

Even with all the above, outputs can drift. Add a validation layer: parse the JSON (does it match the schema?), range-check values (confidence between 0 and 1?), normalize (strip whitespace, lowercase labels, round numbers). The more logic you move to deterministic post-processing, the more reproducible your pipeline.

7. Eval Suites as Reproducibility Guards

Build a golden test set: 20–50 (input, expected_output) pairs. Run before every deploy. If outputs drift beyond tolerance, deploy fails. Use "functionally equivalent" matching — not string equality. Does the classification match? Is the extracted value correct? Rubric-based, not character-by-character. Tools in 2026: DeepEval for CI/CD test suites, W&B Weave for trace-rich evaluations, Langfuse for open-source tracing, LangSmith for LangChain ecosystems.

The Distributed Systems Insight

Here's the "aha" that experienced engineers recognize immediately: LLM reliability is a distributed systems problem wearing an AI costume. The same failure modes you spent the last decade solving — partial failures, cascading latency, stale state, coordination bugs — are back.

Key insight from production teams in 2026: retry the system, not the model. If retrieval failed, retry retrieval. If a tool call timed out, retry the tool call. Don't re-run the entire prompt — each LLM invocation is genuinely a new execution with potentially different semantics. Isolate the probabilistic component and keep everything around it deterministic.

Many so-called "hallucinations" are actually coordination failures: retrieval context didn't arrive in time, embeddings were stale, a tool response timed out and the model improvised, or conflicting instructions appeared because three teams edited system prompts independently. The model filled gaps because the system delivered an incomplete world.

When You Need What

Use Case	Need	Stack
Chat assistant	Low	Temperature 0.7+. Variety is a feature
Classification / labeling	High	Temperature=0, structured output, seed, cache
Data extraction (invoices, forms)	Very high	Temperature=0, JSON schema, validation, cache
Automated decisions (approve/deny)	Critical	All of the above + human-in-the-loop + audit trail

Rule: the higher the stakes, the more layers you stack. A chatbot needs one layer. A loan approval system needs all seven.

async function classifySentiment(text: string) {
const cacheKey = hash({ text, model: "gpt-5.2-2026-02-10" });
const cached = await cache.get(cacheKey);
if (cached) return cached;                    // Layer 4: Cache hit = pure function

const response = await openai.chat.completions.create({
  model: "gpt-5.2-2026-02-10",               // Layer 5: Pinned version
  temperature: 0,                              // Layer 1: Deterministic
  seed: 42,                                    // Layer 2: Reproducible sampling
  response_format: {                           // Layer 3: Structured output
    type: "json_schema",
    json_schema: { name: "sentiment", schema: {
      type: "object",
      properties: {
        sentiment: { enum: ["positive", "negative", "neutral"] },
        confidence: { type: "number" }
      }, required: ["sentiment", "confidence"]
    }}
  },
  messages: [
    { role: "system", content: "Classify sentiment. JSON only." },
    { role: "user", content: text }
  ]
});

const result = JSON.parse(response.choices[0].message.content!);
result.confidence = Math.round(result.confidence * 100) / 100; // Layer 6: Normalize

if (!["positive", "negative", "neutral"].includes(result.sentiment)) {
  throw new Error("Schema violation — retry or fallback");      // Layer 6: Validate
}

await cache.set(cacheKey, result, { ttl: 3600 });
return result;
}

Quick Check

Your AI invoice extractor returns different JSON shapes on re-runs. Your teammate says 'it's just GPU randomness, nothing we can do.' What's the real story?

Do This Next

Audit one AI call in your codebase. What's the temperature? Is the model version pinned (gpt-5.2 vs gpt-5.2-2026-02-10)? Is the output validated? Fix the lowest-hanging fruit — often it's just setting temperature: 0.
Add structured output to one extraction or classification endpoint. Define a JSON schema. Parse and validate. Measure: how often does the output match the schema on first try?
Build a 10-question golden set for your most critical AI feature. Run it twice with the same inputs. Compare outputs. If they diverge, stack more layers: seed, cache, tighter schema. Use DeepEval or W&B Weave to automate this as a CI gate.