There has been a lot of buzz about MedPrompt recently. MedPrompt is a prompt engineering framework researched and developed by Microsoft in a paper released in late November: Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine.

It’s powerful and has been in the spotlight because it was the method used when Microsoft combatted Gemini’s claim that Gemini Pro could outperform GPT-4.

The central question around MedPrompt is: Can a powerful generalist model (GPT-4) combined with effective prompt engineering really outperform expensive, fine-tuned models, that have billions of pieces of domain specific data? Let's find out!

What is MedPrompt?

Developed by Microsoft, MedPrompt is a prompt engineering framework that leverages multiple components to achieve results.

The three main components are: Dynamic few-shot examples, auto generated Chain-of-Thought (CoT) and choice-shuffle ensemble. We’ll dive into each in the next section.

While initially developed to test against medical benchmarks, MedPrompt can be applied to any domain, and has modular components that are easy to implement for any team.

How does MedPrompt work?

MedPrompt combines a few well known prompt engineering design principles: Few-shot learning, self-generated chain of thought steps and a choice shuffle-ensemble.

When faced with a task, the model runs through a basic algorithm that runs those three majors parts. Here’s a quick overview, and then we’ll dive into each section.

MedPrompt steps represented as an algorithm

MedPrompt framework

MedPrompt consists of 3 components, mentioned above, and two major stages: A preprocessing phase and an inference step.

In-Context Learning (ICL) / Few-shot prompting.

Few-shot prompting involves providing the model with a few example question/answer pairs to help it solve the problem. It is arguably the most effective method for in-context learning.

The few-shot examples in the first version of MedPrompt were fixed. They used the same examples regardless of the question at hand. To cover the most ground, the examples spanned a broad range of information.

The examples in the second version of MedPrompt were dynamically selected, based on the question. This ensures that the examples chosen are the most relevant to the question at hand. This is done via a few steps:

  1. Using text-embedding-ada-002, the researchers created vector representations of the training and test questions.
  2. They then searched the embedding space to match each question with semantically similar questions from the test set.

This all occurs in the preprocessing phase.

For more on embeddings, check out our intro article here: A Beginner's Guide on Embeddings and Their Impact on Prompts.

Going from random few-shot to dynamic few-shot lead to a .8% increase accuracy in the MedQA dataset. Given that dynamic selection of examples requires some extra technical work, leaving it out of your prompt engineering stack isn’t the end of the world.

Auto-Chain-of-Thought (CoT)

Next in the prompt engineering pipeline is a Chain-of-Thought step.

CoT uses natural language to encourage the model to generate a series of reasoning steps before solving the task at hand. Breaking down complex problems into a series of smaller steps via CoT helps models generate more accurate answers.

CoT + ICL integrates the reasoning steps of CoT directly into the few-shot demonstrations.

For MedPrompt, the researchers found that they could get better outputs by allowing GPT-4 to generate the CoT step. In contrast, in the Med-Palm 2 study the CoT steps were written by highly specialized professionals with domain knowledge (surgeons and doctors).

Here are examples of a human generated CoT and a GPT-4 version.

human generated CoT and hand-crafted CoT from Med PaLM
CoT steps generated by GPT-4 are longer and provide a finer-grained step-by-step reasoning logic.

The researchers tested the expert crafted chain-of-thought step used in Med-PaLM 2 with the CoT step automatically generated by GPT-4.

Table comparing expert crafted CoT prompts compared to GPT-s set generated
Both methods were evaluated using GPT-4 with fixed 5-shot examples

It seems like having an LLM create the CoT step is better than having specialized domain experts create them. This is a huge finding for the average developer, and it also echoes other findings from this research paper as well: Automatic Chain of Thought Prompting in Large Language Models

Adding the auto CoT to MedPrompt led to a 3.4% increase in accuracy in the MedQA dataset.


Ensembling is a technique that compares many outputs and then comes to a final answer.

The outputs could come from prompts that have different few-shot examples, or a different ordering. Ensembling compares these outputs with functions like averaging, majority vote, etc.

Additionally, by shuffling the components of a few-shot prompt, ensembling can effectively identify and address biases related to the order sensitivity of GPT-4.

The shuffle/ensemble layer increased accuracy in the MedQA dataset by 2.1%

Experiment setup

The experiments spanned a few different medical datasets from the MultiMedQA benchmark suite. MedPrompt was used across these sets, with GPT-4 as the model:

  1. MedQA: USMLE-style questions testing medical competency in the United States.
  2. MedMCQA: Mock and historical questions from Indian medical school entrance exams.
  3. PubMedQA: Questions with 'yes,' 'no,' or 'maybe' answers, based on PubMed abstracts.
  4. MMLU: A subset of tasks from a multitask benchmark suite relevant to medicine.

In general, benchmarks are only one way to measure performance, and certain datasets (like MMLU), have been proven to be problematic at best. But that is for a different article.

Experiment results

Below are the results from the MMQA dataset

Table of results from the experiments
*sourced directly from the original Google paper


  • MedPrompt used 5 dynamically selected few-shot examples and 5 ensemble shuffles
  • MedPrompt achieves state-of-the-art performance on every one of the nine benchmarks

Based on the results, it looks like a strong model plus strong prompt engineering can outperform even domain specific models.

What is really great about this study, is that they went on and tested MedPrompt across six additional datasets, spanning: electrical engineering, machine learning, philosophy, professional accounting, professional law, and professional psychology.

Across these datasets, MedPrompt still saw an average improvement of 7.3% over the baseline zero-shot prompting. In the MultiMedQA datasets, MedPrompt achieved a 7.1% improvement. This shows that the method is translatable.

MedPrompt performance, broken down into pieces

Below are 2 ways to view the effectiveness of the various prompt engineering methods used.

Bar chart showing the performance of a few models and the Medprompt method on MedQA dataset

Breakdown of the accuracy increases between the prompt engineering methods

Fixed few-shot + automated CoT seem to have a high return based on effort.

Does GPT-4 + MedPrompt outperform Google’s Gemini Ultra?

In Gemini’s recent launch they made claims of beating GPT-4 across 30/32 benchmarks. But, there were several issues in how they compared Gemini to GPT-4. For example, they used different prompt engineering methods for each model.

Two bar charts of Gemini Ultra versus GPT-4 performance on MMLU
See "New Google Metric" in second set of columns

So Microsoft responded. They compared Gemini Ultra with Google’s new prompting method against GPT-4-Turbo with a variation of MedPrompt, MedPrompt+.

Introducing MedPrompt+

MedPrompt achieved 89.6% accuracy on the MMLU benchmark, falling less than 0.5% shy of Gemini Ultra. Looking at where MedPrompt was falling short, they found that MedPrompt’s CoT sometimes caused the model to overthink on simple questions that didn’t require extensive reasoning. This spurred them to make some changes and develop MedPrompt+.

MedPrompt+ is a two-prompt system: The first prompt is the normal MedPrompt, and the second is simplified with a diminished auto-generated CoT.

GPT-4 is prompted to pick which strategy should be used based on the question:

If a scratch pad is necessary then it defaults to the prompt with advanced reasoning, MedPrompt. If not, then it will default to the MedPrompt version with diminished CoT steps.

Now let's look at how the two models and methods stack up.

Two of techs biggest giants, pitting their strongest models against each other, with their best prompt engineering.

Let’s look at the results:

bar chart comparing a few models and methods on accuracy on MMLU


  • MedPrompt+ is able to beat Gemini Ultra by less than 0.1%
  • Keep in mind Gemini Ultra is Google’s best model and it isn’t even fully released yet, while GPT-4-Turbo has been out for some months
  • It appears that OpenAI/Microsoft is still ahead in the LLM wars

Wrapping up

We ran through a lot here. My biggest, practical, takeaway is that I’m starting to notice some patterns emerge across these various research papers. Table stakes for your production prompts are some form of ICL and CoT. If you need help implementing this or have questions, join the waitlist and get in touch!

Dan Cleary