GPT-3.5, GPT-4-Turbo, and a physics student walk into a classroom at Durham University. They each are given the same coding assignment, who wins?

That isn’t the opener for a bad joke, but rather the focus of a recent research paper: 'A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course'.

The study evaluated the performance of GPT-3.5 and GPT-4-Turbo, both with and without prompt engineering (spoiler, prompt engineering boosted performance considerably), against physics students in a university-level physics coding (Python) assignment.

In addition to judging performance, the researchers tasked the evaluators to guess if a given submission was written by AI or a human.

I really love this paper because it gives a real-world, semi-controlled, case study into how humans stack up against Large Language Models (LLMs), no fuzzy benchmarks.

Experiment set up

The study took place at Durham University in England. The students were a part of a 10-week coding course within a physics degree at Durham University.

The researchers compared 50 homework submissions by students to 50 AI-generated submissions. Each work was marked blindly by three graders, totaling 300 data points.

Limitations

There were a few limitations to the study:

  • There was no way to be 100% sure that the students didn’t use AI to help with their work
  • The researchers had to assist the LLM by pre-processing some data. For example, the LLMs had trouble parsing tables, so the researchers converted them to Python dictionaries.

With these two limitations in mind, the comparison turns from “Human vs AI” to “Human (most likely not helped by AI) vs AI (with human assistance)” - no study is perfect.

Methods

Here were the 6 different methods tested:

  • GPT-3.5 raw: Minimal adjustments made, submitted directly to OpenAI’s API
  • GPT-3.5 with prompt engineering: The submitted text was modified to follow certain prompt engineering guidelines (see table below)
  • GPT-4 raw: Minimal adjustments made, submitted directly to OpenAI’s API using the gpt-4-1106-preview model (aka, GPT-4-Turbo)
  • GPT-4 with prompt engineering: The submitted text was modified to follow prompt engineering guidelines (see table below)
  • Mixed: Randomly selects and combines contributions from students and GPT-4 with prompt engineering. 10 submissions in total, 5 from each category. This setup is designed to evaluate how seamlessly AI-generated content can blend with human work and to test the evaluators' ability to distinguish between the two.
  • Student: 50 student submissions, randomly selected

A table of information with two columns
Prompt engineering steps used for the GPT-3.5 with prompt engineering, GPT-4 with prompt engineering, and Mixed categories.

For each AI method, 10 submissions were generated (each submission is a PDF containing 14 plots of data). 50 AI submissions, 50 student submissions, for a total of 100 submissions.

Each submission was blindly evaluated by 3 graders, totaling 300 datapoints.

Here’s an important limitation that arose.

Occasionally, the outputs from the LLMs included plain text or non-Python code, and would lead to errors. So in order to get 10 submissions, the prompts were run multiple times until the target number of submissions were reached.

I found this important to note, as it was another example the LLMs needing some assistance along the way.  It's a subtle point, but worth noting.

Hey hey hey everyone, how's it going? This is Dan here from PromptHub. Welcome back to the YouTube channel. Today we're going to be looking at a really interesting study that is the closest thing we have to a controlled experiment of AI versus human in the real world today. This study comes out of Durham University in England and is a comparison of physics students taking a 10-week coding course in Durham against GPT-3.5 and GPT-4. They do a really good job of making this a controlled experiment, providing great insight into where the models are at in a knowledge-based task.

It's a 10-week course at a university level, involving a lot of coding and experimentation. They use Python for all of the tasks. For this experiment, they took a specific homework assignment involving writing Python and compared 50 student submissions against 50 AI submissions generated with five different methods—10 submissions from each method. There were three graders who blindly graded each of the submissions, so there will be performance results for each.

The six methods in total were:

  1. GPT-3.5
  2. GPT-3.5 with some prompt engineering following specific guidelines
  3. GPT-4 using the turbo model from November last year
  4. GPT-4 with prompt engineering
  5. A mixed category where they randomly combined student and AI-generated pieces
  6. The student submissions

The guidelines they added were pretty basic, following a lot of OpenAI guidelines and those we talk about on our blog. These included cleaning up how functions are defined, removing unnecessary information, adding a preamble for context, using placeholders like "here here here" to structure the prompt, and more details. It was pretty straightforward stuff.

There were a couple of limitations to the study, though I still think it's very controlled. First, there's no way to be 100% certain that students didn't use AI in their own submissions. They were instructed not to, but you can never be sure with college students. Secondly, the researchers had to pre-process some data to make the prompts work with the LLMs. For example, the LLMs had trouble with some tables, so they had to convert these to Python dictionaries. It's subtle but important to note as it gives the LLMs a bit of an assist.

Without further ado, let's jump into some results. As we can see, the student submissions (represented by green bars) performed the best with an average score of 92%, beating the next highest AI method by about 10 percentage points. GPT-4 with prompt engineering came in at around 81%.

There are some interesting takeaways. We see significant jumps when applying prompt engineering. For instance, GPT-4 goes from 71% to 81%, which is about a 14% increase. GPT-3.5 shows an even bigger jump, from about 30% to 48%, a 60% increase. This highlights that a little prompt engineering can go a long way, especially with older or less capable models.

Another key point is the drastic performance change from GPT-3.5 to GPT-4, going from 30% to 71%, an insane jump. There was another limitation on the LLM side—sometimes the outputs generated had broken Python code or were not Python code at all. The researchers had to rerun these prompts many times to get the desired 10 submissions for each category. Again, it's subtle, but it's another instance of the LLMs getting a bit of an assist.

After grading performance, the graders were asked to rate the submissions on a scale from 0 to 3 for how likely they thought it was AI or human. A score of 0 meant definitely AI, and 3 meant definitely human. The graders did a good job, with big green chunks on the definitely human side and small green chunks on the AI side, which is the shape you want to see. GPT-3.5, for instance, almost always got a score close to zero. With prompt engineering, the likelihood that the graders thought it was human increased, likely because better performance correlated with human outputs.

The researchers did a good job setting up both experiments, and the graders effectively identified AI versus human submissions. They noted that human submissions were more likely to have bolder design choices, while AI submissions used standard formats and colors for things like tables. This speaks to human creativity and our subjective decision-making versus the more standardized approach of LLMs.

Lastly, there was a comparison of unique words and average word length. Both were very similar, which I think is the main takeaway. While AI tends to use slightly longer words and more unique words, the differences are minor. The main takeaway is how close the two categories are in both metrics.

This is a great study because it's as controlled as it can get—it's not a random benchmark or demo. It's a snapshot of where we are currently, and I think it's a good one to look back on in 6, 10, or 12 months to see how things progress. We'd love to see a follow-up study. I've linked a full rundown of this article below, so feel free to dive deep, add any comments or questions, and see you next time!

Experiment Results

Performance results

We’ve gotten all the set up out of the way, let’s jump right in.

A graphical representation of results

Takeaways

  • Humans stay on top! (For now). The students scored a nice mark of 91.9% compared to 81.1% for the best performing AI category, GPT-4 with prompt engineering
  • Prompt engineering makes a big impact. GPT-4 jumps from 71.9% to 81.1%, a ~12.8% gain and GPT-3.5 goes from 30.9% to 48.8%, a ~58% gain

AI Detection

For the second experiment, the researchers asked the evaluators to rate each submission on a Likert scale, assessing how likely it was that the work was produced by AI. The Likert scale had four options, with accompanying scores:

  1. Definitely AI (score:0)
  2. Probably AI (score:1)
  3. Probably Human (score:2)
  4. Definitely Human (score:3)

A graph of results from the AI identification

Average scores:

  • GPT-3.5: .033
  • GPT-3.5 with prompt engineering: .2
  • GPT-4: .467
  • GPT-4 with prompt engineering: 1.167
  • Student only: 2.367

There is a clear pattern amongst the AI-related categories. The addition of prompt engineering leads to outputs that are more accurate and sound more like student responses. These two attributes happen to be one and the same in this example since the student responses were the highest performing category.

When converting the Likert scale to a binary “AI” or “Human” ranking system, the 3 evaluators  had an average accuracy rate of ~85%

The evaluators were able to discern a pattern amongst the graphical plots in the AI outputs. They tended to appear slightly misaligned, with features such as unusual font sizes and positioning. Student submissions had more bold and unique design choices, such as unconventional color palettes in their plots.

These nuanced, creative decisions by the students served as the clear differentiator when comparing to their AI counterparts. In many cases these decisions were ‘non-traditional’ or slightly unconventional, highlighting a potentially very important point about human creativity compared to the current state of Large Language Models (LLMs).

Let's look at one final graph comparing the words used by AI vs the students.

Two bar charts side by side comparing unique word comparison to average word length comparison

  • AI-generated texts tend to have a slightly higher diversity of vocabulary, as shown by a greater number of unique words compared to human-authored texts.
  • Words used by AI are, on average, longer than those used by humans, indicating a tendency toward more complex or technical language.

Conclusion

While at the moment, the students were able to outperform the models, it’s unclear how long that trend will last; especially considering the significant jump in performance from GPT-3.5 to GPT-4.

The experiments also highlight a trend we see elsewhere in academia and in production: a little prompt engineering goes a long way in producing better and more reliable outputs.

This paper is great because it gives a snapshot in time in regard to how humans stack up against AI on a specific type of knowledge-based task. Bookmark this one to review in a year!

Dan Cleary
Founder