Table of Contents

LLMs are great at generating text, which makes them pretty good at writing code. But, in their current state, using LLMs for code generation, particularly when it comes to generating complex code, isn’t straightforward. There are many failure points along the way. Some of the most common problems are:

  • Logical Errors: LLMs often misinterpret the logical requirements of a task, leading to incorrect or nonsensical code behavior.
  • Incomplete Code: Important sections of code can be left out
  • Misunderstanding Context: Models may fail to grasp the full context of the prompt, causing them to generate code that doesn't align with the intended use

Additionally, there isn’t clear data out there about what types of code generation errors are most typical, and if different LLMs make the same errors.

Luckily, and thanks to research teams at University of Illinois, University of Alberta, Purdue University, The University of Tokyo, and other institutions, we have some empirical evidence on when LLMs get it wrong generating code from their paper, Where Do Large Language Models Fail When Generating Code?

We’ll look at the most popular types of errors when using LLMs for code generation, how different models face different challenges, and what you can do to generate better, more complete code with LLMs.

Experiment setup

To better understand when using LLMs for code generation falls short, the researchers tested multiple models on many code generation tasks and analyzed the errors using the HumanEval dataset.

The dataset consists of a variety of Python programming tasks designed to test the models’ code generation capabilities. It is widely used. Here’s an example task:

# [Task 146] Return the number of elements in the array that are # greater than 10 and both first and last digits are odd.

A total of 558 incorrect code snippets were identified by running the generated code against the provided unit tests.

The errors were analyzed along two dimensions – semantics and syntax:

  • Semantic Errors: High-level logical mistakes, such as missing conditions or incorrect logical directions, that reflect the LLM's misunderstanding of task requirements.
  • Syntactic Errors: Specific code errors, like incorrect function arguments or missing code blocks, that indicate issues in code structure and syntax.

Models Used:

  1. CodeGen-16B: An open-source model by Salesforce trained on a large dataset of Python code.
  2. InCoder-1.3B: Developed by Meta AI, this model uses a causal masking objective to generate code.
  3. GPT-3.5: No intro needed
  4. GPT-4: If GPT-3.5 had a much smarter sibling
  5. SantaCoder: Part of the BigCode project, trained on a diverse dataset of programming languages.
  6. StarCoder: Another BigCode project model, trained on a comprehensive dataset including multiple programming languages.

A table with the models and their performance on the HumanEval dataset

We'll dive deeper into results later, but here are the performances for each of the models.

Dataset: HumanEval

HumanEval is a dataset specifically designed to benchmark LLMs’ ability to generate correct code. It comprises 164 hand-written Python programming tasks, each with unit tests.

The tasks cover various aspects of programming (logic, algorithms, etc.).

Hey everyone, how's it going? This is Dan here. It is a humid day here in New York, a humid Saturday. We just got back from a little run, and we're going to talk about using LLMs for code generation. So if you're using LLMs for code generation, code auditing, or any kind of code-related tasks, you've probably encountered a few problems. Whether it's LLMs kind of misunderstanding the logic, misunderstanding the instructions, important pieces of the code being left out or incorrectly produced, or the context being misunderstood. Usually, the context being misunderstood can be related to the fact that you're asking it to work on a specific section of code, but it needs knowledge of the whole codebase to be effective.

There are a lot of unanswered questions about where LLMs fail in these steps when trying to get an output. What are the most common types of code generation errors? Is it syntactic or semantic? Which models fail at which points? Is there a difference? There is a lot of trial and error you can go through to uncover this. There's also a recent paper, which we’ll be diving into today, that looks at this exact topic.

What they did is basically test a bunch of different models, including GPT-4, GPT-3.5, a sales source model, something from Meta, and a couple of other ones. They tested them on the HumanEval dataset, which is 164 Python tasks. So, 164 tasks across models is pretty extensive. In total, there were 558 incorrect code snippets, and they generally categorized these into two areas: semantic errors, which are logical mistakes, missing conditions, incorrect logic in the code reflecting a misunderstanding of the task requirements; and syntactic errors, which are specific code errors related to the language, incorrect function arguments, and things along those lines.

In terms of methodology, it’s pretty straightforward. They sent these programming tasks to the model, got the output, and tested it against unit tests provided in the eval set. If there was an error, they classified it and did a bunch of analyses based on that, which we will take a look at.

Some semantic errors include additional errors, reference errors, bad code like unnecessary or unusable code, missing steps, and memory errors when hitting infinite loops. Here are the results. We can see that all the LLMs share a lot of overlap. Incorrect condition is pretty high up on most of the models. Smaller models like Codex and CodeGen were more likely to generate meaningless code compared to some of the larger models. The larger models tended to do more constant value errors. GPT-4 overall performs the best; only seven of the 13 characteristics were even caught, so there are a bunch of 0% for all these other types of semantic errors. This could be attributed to the fact that it's a better model with more parameters and more training.

Something worth noting here is that for the same Python task, different LLMs produced buggy code with varying error types. It wasn’t like there was one Python programming task where all of them produced garbage code, or all the models did this or that. Different models have different strengths and require different prompting approaches and styles. Testing is the biggest thing. If you find one of these weaker spots through your own testing, you can adjust the prompts to solve that specific problem, fine-tune to fill in the gaps, or use different models for specific tasks.

Moving on to syntactic errors, this is much more related to the actual code itself, like errors from if statements, looping errors, functional errors, passing the wrong arguments, and things along those lines. We see a similar distribution across all models. Missing code block or incorrect code block is high for all models. GPT-4's errors are more well-contained, making it much more predictable. Again, we see a bunch of 0% in many categories.

The next part of the study looked at how bad these errors are. The researchers took all the times the models produced incorrect code and looked at two metrics to see how wrong they were: Jaccard similarity and Levenshtein distance. Jaccard similarity measures the overlap between the correct output and the model's output, while Levenshtein distance measures the number of edits needed (insertions, deletions, or substitutions) to correct the incorrect code snippet. You want a lower Levenshtein distance, indicating fewer errors.

Results show that when models produced incorrect code, it was often far off, with minimal overlap with the correct answer and requiring lots of edits. While GPT-3.5 and GPT-4 had the highest overall accuracy, they had the largest deviations on both metrics. They usually get it right, but when they get it wrong, it's significantly off.

Not every mistake is equal. They broke down the severity of errors into single-line, single hunk, and multi-hunk (a hunk is a code block). Most errors are single or multi-hunk, requiring more effort to fix. This reinforces that when LLMs generate incorrect code, it varies significantly.

An interesting additional analysis compared prompt length versus code quality. They found that smaller prompts generally had higher success rates, while longer prompts increased the likelihood of errors. Prompts fewer than 50 words performed better. Longer prompts often produced meaningless or garbage code.

This is all context and use-case dependent, but it highlights that keeping prompts short and concise is crucial. Long prompts are not always better. Start with clear, concise, and specific task descriptions. Not all long prompts are bad, especially with few-shot prompting, but concise instructions are key.

That's it for today. Thanks, guys!

Methodology

Researchers used the following steps to conduct their analysis:

  1. Data Collection: Each LLM was prompted with tasks from the HumanEval dataset, and the generated code was collected.
  2. Error Identification: Incorrect code snippets were identified by running the generated code against the provided unit tests.
  3. Error Classification: The errors were classified into semantic and syntactic categories
  4. Statistical Analysis: The correlation between different error characteristics and factors like prompt length, code length, and test-pass rate was analyzed.

Experiment results

With the setup out of the way, let’s jump right into results.

Error types

The study categorized errors into semantic and syntactic types. We’ll start with the semantic errors.

Semantic Errors: High-level, logical mistakes that reflect the model's misunderstanding of task requirements. These include:

  • Condition Errors: Missing or incorrect conditions in the code.
  • Constant Value Errors: Incorrect constant values set in function arguments, assignments, or other parts of the code.
  • Reference Errors: Incorrect references to variables or functions, including undefined names or wrong methods/variables.
  • Operation/Calculation Errors: Mistakes in mathematical or logical operations.
  • Garbage Code: Unnecessary code parts that do not contribute to the intended functionality, such as meaningless snippets, only comments, or wrong logical direction.
  • Incomplete Code/Missing Steps: Absence of crucial steps needed to achieve the task.
  • Memory Errors: Infinite loops or recursions that never terminate.

6 horizontal bar charts displaying the semantic error type breakdown for each model
Distribution of the 13 semantic characteristics for each LLM.

Takeaways

  • All LLMs share issues like incorrect conditions and wrong logical directions, indicating they struggle with handling complex logic conditions regardless of model size and capability.
  • Smaller models (InCoder and CodeGen) were more likely to generate meaningless code and/or code that missed multiple steps.
  • Larger models (GPT-3.5 and GPT-4) tended to make more constant value errors and arithmetic operation errors.
  • Overall, GPT-4 performed the best, exhibiting only 7 of the 13 semantic characteristics, while the other, smaller models exhibited all or most of the error types. More parameters, less problems

Interestingly, even for the same task, different LLMs produced buggy code with varying error types.
Given this information, if you're using LLMs for code generation, you could:

  • Test specific prompt engineering methods to overcome the issues. We’ve written before about how different models require different prompting approaches.
  • Fine-tune one of these models based on where it falls short to fill in the gaps.
  • Use an ensemble of models for different tasks in your product.

Let’s move onto the Syntactic errors

Syntactic Errors: Specific code errors that indicate issues in the structure and syntax of the generated code. These include:

  • Conditional Error: Errors within 'if' statements, causing incorrect code behavior.
  • Loop Error: Mistakes in 'for' or 'while' loops (incorrect boundaries or mismanaged variables).
  • Return Error: Errors in return statements, returning wrong or incorrectly formatted values.
  • Method Call Error: Errors in function calls, including incorrect function names, wrong arguments, or incorrect method call targets.
  • Assignment Error: Errors in assignment statements
  • Import Error: Errors in import statements.
  • Code Block Error: Multiple statements incorrectly generated or omitted

6 horizontal bar charts displaying the semantic error type breakdown for each model
Distribution of the 13 syntactic characteristics of each LLM

Takeaways

  • The distribution of errors is relatively similar across all models.
  • Across all models, the top three error locations are entire code blocks and 'if' statements. This suggests that many code generation errors are significant and require a lot of work to fix.
  • GPT-4’s errors are more well-contained in a smaller number of categories. This suggests that GPT-4 has fewer and more predictable areas of difficulty compared to other models.
  • GPT-4 didn’t produce any errors in multiple categories (see chart where the value is 0%).
  • CodeGen-16B and InCoder-1.3B frequently have errors with incorrect function names. GPT-3.5, SantaCoder, and StarCoder more often encounter incorrect function arguments.
  • More than 40% of the syntactic errors made by all six LLMs could be grouped into missing code block and incorrect code block.

Repair Effort

Errors are one thing, but their severity and how long they take to fix is another.

To figure out how “wrong” the generated code was, the researchers leveraged two metrics: Jaccard similarity and Levenshtein distance.

Jaccard Similarity: Treats code as a set of tokens (for more info on tokens, check out our article here), and measures similarity by the overlap between two snippets. Lower Jaccard similarity indicates fewer common tokens between the generated code and the ground truth, which means the code is less accurate.

Levenshtein Distance: Measures the minimum number of edits (insertions, deletions, or substitutions) needed to correct incorrect code snippets, providing a direct measure of the repair effort. A lower Levenshtein distance indicates that fewer changes are needed to correct the code, which means the generated code is closer to the ground truth.

Below is a graph with the results for each metric.

Jaccard Similarity Scores broken down by model
Jaccard Similarity scores for each model

Levenshtein Distance scores broken down by model
Levenshtein Distance scores for each model

As you can see, the LLM-generated code is often very different from the ground truth. These aren’t just minor errors.

While GPT-3.5 and GPT-4 had the highest overall accuracy, they had the largest deviations when generating incorrect code. They had the highest median Levenshtein distances.

So, the GPT models are more accurate, but when they get it wrong, they get it really wrong.

Not every mistake is equal. A syntactical error of a colon versus a semi-colon is much different than an entire code block being incorrect. The researchers broke down the errors into three categories based on the effort required to fix them:

  1. Single-line errors
  2. Single-hunk errors
  3. Multi-hunk errors

A “hunk” refers to several lines of code.

A bar chart breakdown of the types of errors, broken down by model
An analysis of the types of errors, broken down by model

As you can see, a majority of the errors were single-hunk or multi-hunk, which require a lot of work to repair.

Here’s the main takeaway: When the LLMs generated incorrect code, they tended to generate code that deviated significantly from the ground truth code.

Does Prompt Length Affect Code Quality?

Next up was my favorite part of the study. The researchers analyzed the relationship between prompt length and the model’s ability to generate correct code.

The average prompt length in HumanEval is approximately 67 words, and 40% of the prompts include 50 words or less.

distribution of pass/fail rate compared to prompt length, broken down by model

Takeaways

Effect of Prompt Length:

  • Prompts with fewer than 50 words generally led to better performance across all models.
  • Prompts exceeding 150 words significantly increased the likelihood of errors.

Types of Errors in Long Prompts:

  • Garbage Code: A large portion (64%) of errors in long prompts resulted in garbage code, where the generated code included unnecessary parts that didn't contribute to solving the task.
  • Meaningless Code Snippets: Around 37.5% of errors in long prompts were "meaningless snippets," (syntactically correct, but fail to address the task requirements)
  • Only Comments: Some long prompts resulted in code that consisted only of comments
  • Wrong Logical Direction: Long prompts often caused the models to generate code that deviated significantly from the intended task logic

Prompt engineering implications

  • Longer prompts ≠ better prompts.
  • Test using concise and focused prompts (a best practice anyway)
  • Providing clear and specific task descriptions without unnecessary details can reduce the occurrence of garbage code and other errors.
  • This doesn’t mean all long prompts are bad! Longer prompts can be necessary, especially when doing things like Few Shot Prompting

Wrapping up

If you're using LLMs for code generation, there are a lot of potential points of failure along the way. Hopefully this guide helps you get out in front of a few of them. Precise and clear prompt engineering is always important, but it might be even more important when using LLMs to generate code. As we always say, it's an iterative process when working with LLMs!

Headshot of PromptHub founder Dan Claery
Dan Cleary
Founder