Fine-Tuning vs. Training in the Foundation Model Era

TL;DR

Most problems: prompt or fine-tune. Training from scratch is for edge cases.
Fine-tuning is faster and cheaper. You need enough quality data and a clear task.
Training from scratch: when the foundation doesn't fit (domain, modality, constraints) and you have the resources.

The old playbook was "collect data, train a model." The new one is "try the foundation model first, then fine-tune if needed, train only if necessary." Most teams never get past step 2.

The Decision Tree

Prompt first:

Can you get 80% there with a good prompt? Try it. No training. No infra. Iterate fast.
Use for: classification, extraction, summarization, Q&A. Often enough.

Fine-tune when:

Prompting isn't good enough. You have 100s–1000s of labeled examples. The task is consistent.
Fine-tuning adapts the foundation to your domain. Fewer examples than training from scratch. Faster iteration.

Train from scratch when:

No foundation model fits (exotic modality, strict latency, on-device, regulatory).
You have large-scale data and the budget for compute and expertise.
Rare. Most teams shouldn't start here.

Fine-Tuning: What You Need

Data:

Quality > quantity. 500 good examples beat 50K noisy ones.
Format: input-output pairs. Consistent task definition.

Compute:

LoRA, QLoRA reduce cost. Fine-tuning is cheaper than training. Still not free.
Cloud or on-prem. You provision.

Evaluation:

How do you know it worked? Holdout set, production metrics. Define before you tune.

The Pitfalls

Overfitting — Small data + aggressive tuning = model that memorizes. Use validation. Early stopping.
Catastrophic forgetting — Fine-tuning can hurt general ability. Monitor both task performance and base capabilities.
Vendor lock-in — Fine-tuning via API ties you to a provider. Fine-tune open weights if portability matters.

Collect data. Train from scratch. Months of compute. One model per task.

Click "ML Development With Foundation Models" to see the difference →

Quick Check

You need a model for a domain-specific classification task. What's the right order?

Do This Next

Map your use cases — For each: prompt-only, fine-tune, or train? Document the decision and the threshold.
Run one fine-tuning experiment — Pick a task. Get 100+ examples. Fine-tune a small model. Compare to prompt-only. Document the lift.
Define your "train from scratch" criteria — When would you? List the conditions. Makes the default (don't) explicit.