GPT-3.5, GPT-4-Turbo, and a physics student walk into a classroom at Durham University. They each are given the same coding assignment, who wins?

That isn’t the opener for a bad joke, but rather the focus of a recent research paper: 'A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course'.

The study evaluated the performance of GPT-3.5 and GPT-4-Turbo, both with and without prompt engineering (spoiler, prompt engineering boosted performance considerably), against physics students in a university-level physics coding (Python) assignment.

In addition to judging performance, the researchers tasked the evaluators to guess if a given submission was written by AI or a human.

I really love this paper because it gives a real-world, semi-controlled, case study into how humans stack up against Large Language Models (LLMs), no fuzzy benchmarks.

Experiment set up

The study took place at Durham University in England. The students were a part of a 10-week coding course within a physics degree at Durham University.

The researchers compared 50 homework submissions by students to 50 AI-generated submissions. Each work was marked blindly by three graders, totaling 300 data points.

Limitations

There were a few limitations to the study:

  • There was no way to be 100% sure that the students didn’t use AI to help with their work
  • The researchers had to assist the LLM by pre-processing some data. For example, the LLMs had trouble parsing tables, so the researchers converted them to Python dictionaries.

With these two limitations in mind, the comparison turns from “Human vs AI” to “Human (most likely not helped by AI) vs AI (with human assistance)” - no study is perfect.

Methods

Here were the 6 different methods tested:

  • GPT-3.5 raw: Minimal adjustments made, submitted directly to OpenAI’s API
  • GPT-3.5 with prompt engineering: The submitted text was modified to follow certain prompt engineering guidelines (see table below)
  • GPT-4 raw: Minimal adjustments made, submitted directly to OpenAI’s API using the gpt-4-1106-preview model (aka, GPT-4-Turbo)
  • GPT-4 with prompt engineering: The submitted text was modified to follow prompt engineering guidelines (see table below)
  • Mixed: Randomly selects and combines contributions from students and GPT-4 with prompt engineering. 10 submissions in total, 5 from each category. This setup is designed to evaluate how seamlessly AI-generated content can blend with human work and to test the evaluators' ability to distinguish between the two.
  • Student: 50 student submissions, randomly selected

A table of information with two columns
Prompt engineering steps used for the GPT-3.5 with prompt engineering, GPT-4 with prompt engineering, and Mixed categories.

For each AI method, 10 submissions were generated (each submission is a PDF containing 14 plots of data). 50 AI submissions, 50 student submissions, for a total of 100 submissions.

Each submission was blindly evaluated by 3 graders, totaling 300 datapoints.

Here’s an important limitation that arose.

Occasionally, the outputs from the LLMs included plain text or non-Python code, and would lead to errors. So in order to get 10 submissions, the prompts were run multiple times until the target number of submissions were reached.

I found this important to note, as it was another example the LLMs needing some assistance along the way.  It's a subtle point, but worth noting

Experiment Results

Performance results

We’ve gotten all the set up out of the way, let’s jump right in.

A graphical representation of results

Takeaways

  • Humans stay on top! (For now). The students scored a nice mark of 91.9% compared to 81.1% for the best performing AI category, GPT-4 with prompt engineering
  • Prompt engineering makes a big impact. GPT-4 jumps from 71.9% to 81.1%, a ~12.8% gain and GPT-3.5 goes from 30.9% to 48.8%, a ~58% gain

AI Detection

For the second experiment, the researchers asked the evaluators to rate each submission on a Likert scale, assessing how likely it was that the work was produced by AI. The Likert scale had four options, with accompanying scores:

  1. Definitely AI (score:0)
  2. Probably AI (score:1)
  3. Probably Human (score:2)
  4. Definitely Human (score:3)

A graph of results from the AI identification

Average scores:

  • GPT-3.5: .033
  • GPT-3.5 with prompt engineering: .2
  • GPT-4: .467
  • GPT-4 with prompt engineering: 1.167
  • Student only: 2.367

There is a clear pattern amongst the AI-related categories. The addition of prompt engineering leads to outputs that are more accurate and sound more like student responses. These two attributes happen to be one and the same in this example since the student responses were the highest performing category.

When converting the Likert scale to a binary “AI” or “Human” ranking system, the 3 evaluators  had an average accuracy rate of ~85%

The evaluators were able to discern a pattern amongst the graphical plots in the AI outputs. They tended to appear slightly misaligned, with features such as unusual font sizes and positioning. Student submissions had more bold and unique design choices, such as unconventional color palettes in their plots.

These nuanced, creative decisions by the students served as the clear differentiator when comparing to their AI counterparts. In many cases these decisions were ‘non-traditional’ or slightly unconventional, highlighting a potentially very important point about human creativity compared to the current state of Large Language Models (LLMs).

Let's look at one final graph comparing the words used by AI vs the students.

Two bar charts side by side comparing unique word comparison to average word length comparison

  • AI-generated texts tend to have a slightly higher diversity of vocabulary, as shown by a greater number of unique words compared to human-authored texts.
  • Words used by AI are, on average, longer than those used by humans, indicating a tendency toward more complex or technical language.

Conclusion

While at the moment, the students were able to outperform the models, it’s unclear how long that trend will last; especially considering the significant jump in performance from GPT-3.5 to GPT-4.

The experiments also highlight a trend we see elsewhere in academia and in production: a little prompt engineering goes a long way in producing better and more reliable outputs.

This paper is great because it gives a snapshot in time in regard to how humans stack up against AI on a specific type of knowledge-based task. Bookmark this one to review in a year!

Dan Cleary
Founder