One thing that most people building in AI agree on is that doing evaluations on prompts and outputs is underdeveloped. The most common method seems to be some form of a “vibe check". Which is just. a human checking out the response and giving a thumbs up or down.

But once you’re dealing with even just hundreds of requests, an automated solution becomes a necessity. This is usually when teams turn to LLMs for evaluations.

But how reliable are LLMs at evaluating LLM outputs? Can we leverage a model like GPT-4 for the bulk of our evals?

Insights from latest research

In my deep dive into LLM-powered evaluations, I came across two notable frameworks and a meta-analysis that stood out:


Published last February, GPTScore introduced an evaluation framework that utilizes conditional generation probability to perform evaluations.

For example, let’s say we want to use this framework to grade article summaries. Given an article and its corresponding summary, GPTScore evaluates the quality of the summary by determining the likelihood of a model generating that summary text, given the context.

This method relies on the idea that high-quality text would align with the patterns that the model learned during its training period.

This works well when the evaluation criteria are focused on aspects of text quality that a model is trained to recognize (coherence, grammatical correctness, relevance).

But there are a few problems with this approach:

  • Potential biases in the LLM due to its training data
  • Evaluating subjective criteria that doesn't align with the model's learned patterns
  • Just because it is probable that the evaluator LLM would generate the same summary, doesn't mean it is inherently high quality (see point above)

Below are the results from the experiments:

Results from GPTScore experiments

The scores listed in the table are the Spearman correlations for different aspects (fluency, coherence, etc) on text summarization datasets. The Spearman correlation measures how well the evaluations from the LLM match the rankings given by human evaluators. The higher the correlation, the more closely the LLM's evaluations align with the human evaluators.


  • Providing an example (essentially using few-shot prompting) in the evaluation process led to improved outcomes (see to IST column)
  • The highest Spearman Correlation for GPT models was ~47 and the average was ~43.3.
  • The LLMs exhibit moderate agreement with human evaluations. A lot of room for growth
  • With Spearman scores at this level, you can't rely on LLMs alone

My main question with this experiment lies in the evaluation methodology. While I was reading it, I wondered why the researchers didn’t prompt the model to output scores directly, rather than determining the likelihood of the LLM generating that summary.

Fast forward a few months and the research team at Microsoft had an answer.

G-EVAL: A GPT-4 based evaluation framework

Overall, G-EVAL is an improvement over the GPTScore framework, achieving an average Spearman correlation of 0.514 on summarization tasks.

Unlike GPTScore, G-EVAL directly performs the evaluations with a form-filling method, resulting in a numbered output from the model.

For example, let’s say we are evaluating the coherence of a summary of a news article. G-EVAL compiles the prompt, evaluation criteria, news article, and the summary. It then calls the LLM, which outputs a score based on the determined evaluation criteria.

G-EVAL has two 2 main components

1. A Prompt that contains the evaluation task criteria

2. A set of evaluation steps generated from an auto Chain-of-Thoughts (CoT) prompt

Below are some results from the experiments

Table of outputs from SummEval benchmark tests
Results from SummEval benchmark

Table of outputs from Topical-Chat benchmark tests
Results from Topical-Chat benchmark

Table of outputs from QAGS benchmark tests
Results from QAGS benchmark


  • You can see much higher spearman scores ( up to ~.70 across some dimensions) compared to GPTScore
  • Adding the autogenerated CoT doesn't affect Spearman scores strongly (see first table)
  • The framework deserves some credit, but important to acknowledge GPT-4 is being used for all requests, compared to GPT-3 in GPTScore’s study
  • A larger model size can enhance the performance of LLM evaluation frameworks

G-EVAL made big strides, but there is still a lot of room to grow those Spearman scores.
That's where our  last paper steps in: A Closer Look into Automatic Evaluation Using LLMs

Building on G-EVAL

Researchers from National Taiwan University saw the great work done with G-EVAL and wanted to dig deeper to see what changes could be made to further align LLM evaluations.

Generally they wanted to answer 3 questions:

  • Does adding a set of evaluation steps from a CoT prompt increase alignment?
  • What if we didn't force the LLM to respond with only a number rating? (Done in G-EVAL and GPTScore)
  • Can prompt engineering lead to more aligned evaluations?

The researchers put together 4 prompt variations to test across the evaluation experiments.

Prompting Methods

To address the question of whether or not forcing the LLM to output a single number affected correlation scores, the researchers tested a few prompts to extract scores from the model.

Score only

Returns only a score, as in G-EVAL.

Free Text

The prompt for Free Text depend on the attribute. Here is an example for judging the coherence.

With this prompt, the model is allowed to generate more than a numeric answer, but it doesn't have too. In fact, when using this prompt, the researchers found that that the LLM responded mostly with a single numeric rating, similar to the behavior when instructed by Score Only.


Asks the model to provide numerical rating first and then explain why its reasoning.


Asks the model to analyze the samples and evaluation criteria first and then give a rating.

Let's move onto the experiment results.

Experiment Results

The researchers ran a variety of experiments, with GPT 3.5-turbo as their model. Below are the results from one set of benchmark testing.

A table of results from the meta analysis of E-EVAL

The items in the blue row are the results from using G-EVAL with GPT 3.5, while the results in the yellow row are using the new prompt mentioned above. Bolded numbers represent figures that are statistically significantly higher than the baseline (except GPT-4).

Addressing the first question related to CoT, you can see that the auto CoT tended to underperform compared to other methods ( will discuss those shortly).


  • The addition of CoT  doesn't always increase alignment
  • The researcher's new prompts consistently outperformed G-EVAL
  • Outputs from the Free Text prompt usually contained only a single numeric rating (same behavior as Score only), but consistently outperformed Score only. This highlights that what the model is allowed to generate is sometimes more important than what it actually generates.
  • Rate-explain and Analyze-rate consistently outperform other methods. Again, this highlights one of the key tenants of prompt engineering: Giving the model room to think.
  • Clearly, prompt engineering can lead to better, more highly correlated, evaluations
  • Forcing the model to output a single number leads to less correlation
  • The researchers tested across temperatures and found that Rate-explain and Analyze-rate consistently achieved higher correlations compared with G-EVAL. There was no significant changes in correlations for these 2 methods as the temperature varied.

A lot of pieces came together with this paper, including a ton of practical tips for anyone doing evals today. Aside from Spearman scores, there are other issues that arise when thinking about using LLMs for evaluations. What about bias?

Can we trust LLMs to evaluate LLM outputs?

When doing evalutaions, do LLMs have a bias for outputs generated by LLMs.

The team behind G-EVAL put this to the test. They compared the eval scores of LLM generated summaries and (high-quality) human-written summaries. The results are below:

Bar graphs comparing human summary ratings and GPT summary ratings
GPT-3.5 was the model used in this experiment

The dataset is divided into three categories (in order):

  • When human-written summaries are rated higher than LLM summaries by human judges
  • When human-written summaries are rated lower by human judges
  • When human-written summaries are rated equally good by human judges

You can see that the LLM consistently assigns higher scores to GPT-3.5 summaries, even when human judges prefer human-written summaries.

This isn’t great, but it isn’t as bad as it looks, here is some more context.

Evaluating outputs from high-quality Large Language Models (LLMs) is inherently challenging.

Even the human annotators providing evaluations found it difficult to reach a consensus. They used a metric, Krippendorff's alpha, to determine the inter-rater agreement, ranging from 0 to 1, where 1 indicates complete agreement. In this case, the score, was 0.07! That means there was very little agreement amongst the human evaluators on the quality of the outputs.

The lack of consensus among human evaluators reflects the truth that agreeing on which summaries are “better” is extremely difficult.

Can’t rely on benchmarks alone

Not only may user preferences not align with LLM preferences, but they might diverge from model quality. The following chart, from Coatue’s AI report, illustrates this point.

2 bar graphs comparing human preference and performance across Claude 1 and Claude 2

Can we use LLMs as evaluators?

Yes and no. LLMs are incredibly efficient at processing large volumes of data, which makes them valuable for scaling the evaluation process. But the current Spearman correlation scores indicate that LLMs aren't yet reliable enough to be the sole evaluators.

The most effective strategy, and our recommendation, is a hybrid approach. By combining the computational power of LLMs with the nuanced understanding of human evaluators, you can get the best of both worlds.

At the end of the day, user feedback is king. While an LLM may prefer one prompt over another, it's the users' preferences and their feedback that should guide the final evaluation. As shown in the graph above, sometimes users have differing opinions than what is sometimes the “truth”.

Dan Cleary