Skip to main content

Prompt Management in Production

5 min read

Platform

Prompts are config. Treat them like feature flags — versioned, rollbackable.

Tpm

Prompt changes = product changes. Include in release notes and QA.

Devops

Same CI/CD for prompts. Lint, test, deploy. No ad-hoc edits in production.

Prompt Management in Production

TL;DR

  • Prompts are code. Version them in Git, test them, deploy via your normal pipeline.
  • One typo can tank output quality. One "improvement" can double latency. Track changes.
  • Monitor: latency, token usage, error rate, and — if possible — output quality (sampling, heuristics).

"Just update the prompt" is how production breaks. Here's how to do it right.

Why Prompts Need Discipline

  • Fragility. "Summarize" vs "Summarise" can change behavior. Extra line breaks matter.
  • Drift. Someone edits the staging prompt. Forgets to sync prod. Two systems, different behavior.
  • Cost. Longer prompts = more input tokens = higher cost. Unchecked prompt growth bleeds money.
  • No rollback. If a prompt change degrades quality, how do you revert? Hope you have a backup?

Prompts as Code

Store prompts in your repo. Not in the LLM provider's dashboard. Not in a random doc.

prompts/
  summarize_ticket.yaml      # or .json, .md — pick one
  suggest_labels.yaml
  faq_answer.yaml

Each file: the prompt template + metadata (model, temperature, max_tokens).

Use a template engine (Jinja, Mustache) to inject variables: {{context}}, {{question}}.

Versioning

  • Git. Every prompt change is a commit. You get history, diff, blame.
  • Tag releases. "v1.2 of summarize_ticket" = a specific commit.
  • Env-specific. Staging can point to main; prod to a tagged release. No surprises.

Testing Prompts

  1. Unit test the template. Does it render? No undefined variables? No broken syntax?
  2. Regression suite. Golden set: 10–50 (input, expected_output) pairs. Run before deploy. If outputs drift, flag it.
  3. A/B or shadow. New prompt runs in shadow; compare to current. Roll out only if metrics improve.

Golden sets are imperfect — LLMs are non-deterministic. Use temperature=0 for tests, or relax "expected" to "contains key phrases" / "satisfies rubric."

Monitoring

MetricWhy
Latency (p50, p99)Prompt length and model choice affect it. Spikes = problems
Token usage (in/out)Cost control. Catch runaway prompts
Error rateAPI failures, timeouts
Output lengthSudden change = prompt or model change
Quality (sampling)Manual review of 1% of outputs. Expensive but catches subtle regressions

Rollback

If a prompt goes bad:

  • Revert the commit. Redeploy.
  • Or: feature flag to old prompt. Instant switch without full deploy.
# prompts/summarize_ticket.yaml
system: |
You are a support ticket summarizer. Be concise. Output JSON only.
template: |
Summarize this ticket in 2-3 sentences. Include: customer issue, priority signal.
Ticket: {{ ticket_text }}
variables: [ticket_text]
model: gpt-4o-mini
temperature: 0
max_tokens: 200

Quick Check

A prompt change ships. Latency doubles. What should you have had in place?

Do This Next

  1. Audit. Where are your prompts today? Editor? Dashboard? Scattered in code?
  2. Extract one into a file. Add variables. Render it from code. Does it work?
  3. Add one golden test. Single (input, expected) pair. Run it in CI. Make it green.