In general there are currently three methods to get better outputs from LLMs.

  1. Prompt engineering
  2. Fine-tuning
  3. Retrieval-Augmented Generation (RAG)

Prompt engineering is the best place to start. It will help you identify the limitations you're encountering and which of the other methods you can use as solutions.

For example, does the model need more context? If so, RAG would be a good method to implement. Does it feel like the model isn’t consistently following the instructions? Then fine-tuning might help. OpenAI has a detailed video on this optimization process, check it out here.

a flow chart starting at prompt engineering and going to RAG and fine-tuning after a testing and evaluating phase
A typical approach when starting to work with LLMs

So when it comes to fine-tuning vs prompt engineering, we don’t believe it should be an either/or decision, and certain situations call for a mix of both.

Let's dive deep into both methods, including the latest research.

Understanding the methods

Let’s set our foundation with some quick definitions and examples.

Fine-tuning

Fine-tuning involves adapting a pre-trained model, like OpenAI’s GPT-3.5, by training it further on new data to enhance its understanding and response capabilities regarding nuances not initially covered. There are tools to help you do this, like Entry Point and OpenAI’s tools.

For example, in healthcare, fine-tuning an LLM on medical records could enable the model to generate personalized treatment plans by recognizing subtle differences in patient data.

Prompt engineering

Prompt engineering involves writing tailored inputs to get desired outputs from LLMs. It doesn’t require changing the model’s underlying structure, but involves crafting prompts that guide the model’s response accurately.

There are many prompt engineering methods, like few-shot prompting, analogical prompting, chain of density, and more. Each method tends to help combat one or a few limitations in present-day LLMs.

Fine-tuning vs prompt engineering

It really isn’t an either/or question. Both methods aim to get better outputs from LLMs in different ways. Fine-tuning is more resource-heavy, involving time and energy to generate and clean data to further train the model. But, fine-tuning does a great job in delivering highly accurate and context-aware outputs, and can also decrease cost through fewer tokens needed in your prompt.

On the other hand, prompt engineering is much more agile and adaptable, allowing for a faster start, but it may sometimes sacrifice the depth of customization that fine-tuning provides.

Here is a more visual way to think about the various methods.

Venn digram comparing prompt engineering, RAG, and Fine-tuning
Credit: Mark Hennings

Let’s take a look at two papers that put the question of fine-tuning vs. prompt engineering head-to-head.

Case studies: Medical application (MedPrompt)

Back in November 2023, a paper was released by Microsoft: Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine.

The conventional wisdom at the time was that healthcare was a great domain for fine-tuning because it requires specialized knowledge and deals with complex data that varies patient to patient.

In this paper, Microsoft researchers designed an experiment to compare the performance of a fine-tuned model against a foundational model equipped with a prompt engineering framework.

The foundational model, GPT-4, utilized Microsoft's MedPrompt framework and the fine-tuned model, Med-PaLM 2, developed by Google, was tailored specifically for the medical domain.

For more info on the framework and the experiments, check out our in-depth analysis here: Prompt Engineering vs. Fine-Tuning: The MedPrompt Breakthrough.

GPT-4, using the MedPrompt framework, was able to achieve state-of-the-art performance on every one of the nine benchmarks, outperforming Med-PaLM 2 by up to 12 absolute percentage points.

a table of results comparing the fine-tuned Med-PaLM 2 model and GPT 4 with and without Medprompt
Note the gains from Medprompt alone when compared to vanilla GPT-4

A graph of test accuracy versus various models, fine-tuned and not, over time
Does the value of fine-tuning decrease over time?

Additionally, Medprompt was tested across various datasets and performed well, proving that the prompt engineering framework can be effective across different domains.

So if Google’s fine-tuned model can be bested by a general model, has prompt engineering emerged victorious in the battle of fine-tuning vs prompt engineering? Can fine-tuning still provide value?

I believe the answer is yes, although not in situations like this one, where proprietary data is used to teach the model new things. Foundational models are so advanced and have processed so much data that even if your data contains unique nuances, the models likely have enough information to generalize and extrapolate. For a much better breakdown of this question, I’d recommend listening to this recent episode from the Ben and Marc show.

Fine-tuning is extremely helpful in demonstrating to the model how outputs should sound (tone) and be structured. It also enables you to save on prompt tokens, as you won’t need to provide as detailed instructions or examples in your prompt.

What wasn’t tested was using the fine-tuned model and the Medprompt framework. I would’ve really liked to see how that model and method combo would’ve performed.

Case studies: Code review

Just in the past month (May 2024), there was a paper out of a university in Australia that pitted fine-tuning against prompt engineering: Fine-Tuning and Prompt Engineering for Large Language Models-based Code Review Automation. Since we haven’t covered this paper in previous posts, we’ll do a little bit of a dive before getting into the results.

Experiment setup

Objective: The experiment aimed to evaluate the performance of LLMs on a code review task, comparing the effectiveness of fine-tuning versus prompt engineering. Two LLMs, GPT-3.5 and Magicoder, were tested using a few different methods.

Methods tested:

  1. Fine-Tuning: Adapting the models to the code review context by further training
  2. Prompt Engineering: Zero-shot learning, few-shot learning, and using personas

Datasets:

The datasets are comprised of code from GitHub, mirroring real-world scenarios.

Evaluation metrics:

  • Exact Match (EM): This metric assessed how closely the model-generated code matches the actual revised code in the testing datasets. For evaluation, both the generated and actual code were tokenized and compared at the token level.

Parameter settings:

  • GPT-3.5: Temperature of 0.0, top_p of 1.0, max length of 512. Fine-tuning parameters such as number of epochs and learning rate were set as per OpenAI's API defaults. For more info on all the OpenAI parameters, check out our guide here.
  • Magicoder: Utilized the same hyper-parameter settings as GPT-3.5 for direct comparison.

A zero-shot prompt template for a code improvement task
A zero-shot learning prompt with a persona

A few shot prompt template for a code improvement task
A few shot learning prompt

Fine-tuning details

  • Selection of Training Examples: Due to the high costs associated with using the entire training set, the researchers selected a random subset of examples for fine-tuning. Specifically, approximately 6% of the training examples were used to fine-tune the models at a total cost of approximately $40.00.
  • Data Composition: The training set included both the code submitted for review and the corresponding revised code, allowing the models to learn from real changes.

Experiment results

Let’s look at the results!

a table of results showing the performance of various prompt engineering methods versus fine-tuned models

Main takeaways

  • Fine-tuning GPT-3.5 with zero-shot learning outperformed all prompt engineering methods, achieving a 63.91% - 1,100% higher Exact Match (EM) than non-fine-tuned models.
  • Few-shot learning was the highest performing prompt engineering method, achieving 46.38% - 659.09% higher EM than GPT-3.5 with zero-shot learning. This comes as no surprise if you’ve read our few shot prompting guide
  • Including a persona ("pretend you’re an expert software developer in <lang>") led to worse performing prompts. GPT-3.5 achieved 1.02% - 54.17% lower EM when a persona was included.
  • The persona prompts likely introduced biases or sent the model down a wrong path that did not match the actual criteria required for optimizing and reviewing code

Wrapping up

As mentioned at the beginning of this post, prompt engineering and fine-tuning are not mutually exclusive strategies, but I hope this post gives you more insight into both methods. Each method has its own use case and is dependent on the task at hand. Since prompt engineering requires less initial effort—only needing access to a tool like PromptHub to start exploring—it's typically the best starting point.

Dan Cleary
Founder