[BEE-30005] Prompt Engineering vs RAG vs Fine-Tuning

INFO

There are three ways to make an LLM do what you want — prompt engineering, retrieval-augmented generation, and fine-tuning — and choosing the wrong one wastes weeks of work and tens of thousands of dollars. The right decision depends on what kind of problem you are actually solving.

Context

When GPT-3 was released in 2020, the only customization mechanism available to application developers was prompt engineering: carefully crafted instructions and examples passed in the input. RAG emerged as a pattern in 2020, formalized in the Lewis et al. paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (arXiv:2005.11401, NeurIPS 2020), which showed that retrieving relevant documents at inference time dramatically improved factual accuracy on open-domain question answering without retraining the model.

Fine-tuning existed long before LLMs but became practically inaccessible for most organizations when models scaled to billions of parameters — the compute cost of updating all weights was prohibitive. Two papers changed this. Edward Hu et al.'s LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685, 2021) showed that fine-tuning could be accomplished by injecting a pair of small low-rank matrices into each Transformer attention layer, reducing trainable parameters by 10,000× with no inference latency penalty. Tim Dettmers et al.'s QLoRA: Efficient Finetuning of Quantized LLMs (arXiv:2305.14314, 2023) added 4-bit quantization to LoRA, making it possible to fine-tune a 65 billion parameter model on a single consumer GPU.

These developments created three legitimate options with very different cost, complexity, and applicability profiles. The mistake practitioners make is reaching for fine-tuning first — the most expensive and time-consuming option — before exhausting the simpler ones. The correct default order is: prompt engineering first, RAG if knowledge is the bottleneck, fine-tuning only when consistent behavior at scale cannot be achieved otherwise.

Design Thinking

Each customization strategy addresses a different root cause of poor LLM performance:

Strategy	Problem it solves	What it changes
Prompt engineering	Model doesn't understand the task or format	The input to the model
RAG	Model lacks knowledge (outdated or proprietary)	The context available at inference time
Fine-tuning	Model behavior is inconsistent at scale	The model's weights

A model that produces the wrong output format is a prompt engineering problem. A model that gives outdated information or lacks proprietary domain knowledge is a RAG problem. A model that sometimes follows the task correctly and sometimes doesn't — after prompting is exhausted — is a fine-tuning candidate.

Prompt engineering is always the starting point because it requires no infrastructure, no training data, and no compute — only a well-specified task. 40–70% of LLM problems are solved at this stage.

Best Practices

Exhaust Prompt Engineering Before Anything Else

MUST attempt prompt engineering first. Spend at least two to five days iterating on the prompt before concluding that the model cannot do the task through instruction alone.

Effective techniques, in order of complexity:

Zero-shot: Describe the task precisely with output format specification
Few-shot: Provide three to five high-quality input-output examples; examples outweigh instructions
Chain-of-thought: Add "Think step by step" or show reasoning in the examples — this alone often resolves failures on reasoning tasks
Structured output constraints: Force JSON mode or specify an exact schema to eliminate format inconsistency

Prompt engineering fails when the model lacks the underlying knowledge (not solvable by instruction), when the task requires hundreds of examples that exceed the context window, or when the output must be perfectly consistent across 99.9% of requests.

SHOULD measure baseline performance on a representative test set before concluding prompt engineering is insufficient. A prompt that works for obvious inputs may fail on edge cases — collect failure examples to understand what is actually wrong before choosing the next step.

Add RAG When Knowledge Is the Bottleneck

SHOULD choose RAG over fine-tuning when the problem is that the model lacks information, not that its behavior is wrong. RAG is the right choice when:

The model needs access to information created after its training cutoff
The model needs access to proprietary or confidential data not in its training corpus
Responses must be traceable to specific source documents
The knowledge base changes frequently and retraining would be prohibitively expensive

MUST NOT use RAG as a substitute for prompt engineering for behavior problems. A model that incorrectly formats JSON output will continue to incorrectly format JSON even with retrieval augmentation — the problem is behavioral, not informational.

RAG failure modes: noisy retrieval (wrong chunks returned), missing retrieval (relevant document not in the corpus), and retrieval-generation mismatch (retrieved documents confuse rather than help the model). Diagnose retrieval and generation failures separately — the RAGAS framework (BEE-30004) measures each independently.

Fine-Tune When Behavior Must Be Consistent at Scale

SHOULD consider fine-tuning when prompt engineering and RAG have been exhausted, you have at least 100 high-quality labeled examples, and you have measured a concrete performance gap between the current best approach and the target.

Fine-tuning is the right choice when:

Output format or style must be consistent and prompt-level constraints produce occasional failures
Domain-specific terminology is required that the model does not reliably produce
Reducing inference cost at high volume matters — baking behaviors into weights reduces prompt length and therefore per-request cost
The task requires more examples than fit in the context window

MUST NOT fine-tune on fewer than 50 examples. Below this threshold, the model overfits to the training data and performs poorly on unseen inputs. 200 high-quality, diverse examples typically outperform 2,000 noisy ones.

Fine-tuning failure modes: catastrophic forgetting (model loses general capability), overfitting (excellent training accuracy, poor test accuracy), and distribution shift (fine-tuned model performs well on training distribution but fails on real production inputs). LoRA and QLoRA substantially reduce catastrophic forgetting by freezing 99.99% of model weights.

Use Parameter-Efficient Fine-Tuning (LoRA / QLoRA)

SHOULD use LoRA or QLoRA for any fine-tuning task rather than full fine-tuning. Full fine-tuning updates all model weights, requires 100+ GB of GPU memory for 7B parameter models, and risks catastrophic forgetting. LoRA injects trainable low-rank matrices into attention layers, reducing trainable parameters by 10,000× with no inference latency penalty. QLoRA adds 4-bit quantization to LoRA, enabling fine-tuning of 65B models on a single 48 GB GPU.

LoRA: Woriginal (frozen) + AB (trainable, rank r << d)
      where A ∈ R^{d×r}, B ∈ R^{r×d}, r typically 8–64

Trainable parameters: 2 × d × r per attention layer
vs. Full fine-tuning: d × d per attention layer

For most teams without GPU infrastructure, SHOULD use a managed fine-tuning service rather than operating GPU infrastructure:

Service	Models supported	Notes
OpenAI Fine-tuning API	GPT-4o mini, GPT-4o	Simplest; $25/M training tokens
Google Vertex AI	Gemini 2.5 Pro/Flash	Multimodal fine-tuning
Amazon Bedrock	Claude 3 Haiku	For Anthropic models
Together AI	Llama, Mistral, Qwen	Open-source models, managed

Evaluate Fine-Tuning by Comparing Against a Baseline

MUST compare fine-tuned model performance against the best available baseline (base model + best prompt, with or without RAG) on a held-out test set — not on the training set. Measuring improvement only on training data confirms overfitting, not learning.

A rigorous comparison:

Set A: Base model + best prompt
Set B: Base model + best prompt + RAG
Set C: Fine-tuned model + minimal prompt
Set D: Fine-tuned model + RAG

Deploy only if Set C or Set D materially outperforms the best available baseline
on the held-out test set for your target metric.

A fine-tuning project is justified when it produces a measurable, statistically significant improvement (typically 5% or more) on the metric that matters for the application.

Combine Strategies When Each Addresses a Different Problem

MAY combine prompt engineering, RAG, and fine-tuning when each addresses a genuinely distinct problem. Fine-tune for consistent behavior and output format; add RAG for real-time knowledge; use the prompt for task framing. This combination — sometimes called RAFT (Retrieval-Augmented Fine-Tuning) — is appropriate for high-volume, high-stakes applications where the investment is justified.

Do not combine strategies to compensate for a weak implementation of one. A poorly performing RAG pipeline plus a poorly fine-tuned model produces a poorly performing combined system. Fix each component before combining.

Visual

BEE-30001 -- LLM API Integration Patterns: token cost management, streaming, and semantic caching apply regardless of customization strategy
BEE-30004 -- Evaluating and Testing LLM Applications: the RAGAS metrics, golden datasets, and LLM-as-judge patterns are the evaluation tools for measuring whether each customization strategy succeeded
BEE-17004 -- Vector Search and Semantic Search: the retrieval component of RAG is a vector search problem; chunking, embedding, and top-k retrieval patterns are covered there
BEE-9001 -- Caching Fundamentals: fine-tuned model responses are more predictable and thus more cache-friendly; semantic caching applies to both RAG and non-RAG LLM calls

[BEE-30005] Prompt Engineering vs RAG vs Fine-Tuning ​

Context ​

Design Thinking ​

Best Practices ​

Exhaust Prompt Engineering Before Anything Else ​

Add RAG When Knowledge Is the Bottleneck ​

Fine-Tune When Behavior Must Be Consistent at Scale ​

Use Parameter-Efficient Fine-Tuning (LoRA / QLoRA) ​

Evaluate Fine-Tuning by Comparing Against a Baseline ​

Combine Strategies When Each Addresses a Different Problem ​

Visual ​

Related BEEs ​

References ​