Prompt Management in Production

TL;DR

Prompts are code. Version them in Git, test them, deploy via your normal pipeline.
One typo can tank output quality. One "improvement" can double latency. Track changes.
Monitor: latency, token usage, error rate, and — if possible — output quality (sampling, heuristics).

"Just update the prompt" is how production breaks. Here's how to do it right.

Why Prompts Need Discipline

Fragility. "Summarize" vs "Summarise" can change behavior. Extra line breaks matter.
Drift. Someone edits the staging prompt. Forgets to sync prod. Two systems, different behavior.
Cost. Longer prompts = more input tokens = higher cost. Unchecked prompt growth bleeds money.
No rollback. If a prompt change degrades quality, how do you revert? Hope you have a backup?

Prompts as Code

Store prompts in your repo. Not in the LLM provider's dashboard. Not in a random doc.

prompts/
  summarize_ticket.yaml      # or .json, .md — pick one
  suggest_labels.yaml
  faq_answer.yaml

Each file: the prompt template + metadata (model, temperature, max_tokens).

Use a template engine (Jinja, Mustache) to inject variables: {{context}}, {{question}}.

Versioning

Git. Every prompt change is a commit. You get history, diff, blame.
Tag releases. "v1.2 of summarize_ticket" = a specific commit.
Env-specific. Staging can point to main; prod to a tagged release. No surprises.

Testing Prompts

Unit test the template. Does it render? No undefined variables? No broken syntax?
Regression suite. Golden set: 10–50 (input, expected_output) pairs. Run before deploy. If outputs drift, flag it.
A/B or shadow. New prompt runs in shadow; compare to current. Roll out only if metrics improve.

Golden sets are imperfect — LLMs are non-deterministic. Use temperature=0 for tests, or relax "expected" to "contains key phrases" / "satisfies rubric."

Monitoring

Metric	Why
Latency (p50, p99)	Prompt length and model choice affect it. Spikes = problems
Token usage (in/out)	Cost control. Catch runaway prompts
Error rate	API failures, timeouts
Output length	Sudden change = prompt or model change
Quality (sampling)	Manual review of 1% of outputs. Expensive but catches subtle regressions

Rollback

If a prompt goes bad:

Revert the commit. Redeploy.
Or: feature flag to old prompt. Instant switch without full deploy.

# prompts/summarize_ticket.yaml
system: |
You are a support ticket summarizer. Be concise. Output JSON only.
template: |
Summarize this ticket in 2-3 sentences. Include: customer issue, priority signal.
Ticket: {{ ticket_text }}
variables: [ticket_text]
model: gpt-4o-mini
temperature: 0
max_tokens: 200

Quick Check

A prompt change ships. Latency doubles. What should you have had in place?

Do This Next

Audit. Where are your prompts today? Editor? Dashboard? Scattered in code?
Extract one into a file. Add variables. Render it from code. Does it work?
Add one golden test. Single (input, expected) pair. Run it in CI. Make it green.