This one is for anyone using LLMs for summarization. A recent paper looked into how to make summaries more detailed, without being dense or hard to follow. Enter, Chain of Density (CoD).

What is Chain of Density prompting

Chain of Density is prompt engineering method designed to generate multiple summaries that become progressively more detailed, without increasing their length.

An initial summary is generated that is entity-sparse, meaning it contains few key nouns or noun phrases that encapsulate the main points of the article. Then CoD iteratively incorporates additional entities.

An entity is defined as:

  • Relevant: to the main story.
  • Specific: descriptive yet concise
  • Novel: not in the previous summary
  • Faithful: present in the Article
  • Anywhere: located anywhere in the Article.

Why Chain of Density

CoD was designed to address a few specific problems.

  • Typical summaries from LLMs tend to oversimplify and miss crucial details.
  • Mitigate LLMs' tendency to primarily focus on the initial part of the content they summarize, a phenomenon known as lead bias.
  • Get rid of fluff by maintaining control over the summary's length.

How Chain of Density works

The greatest advantage of CoD is its simplicity, being just a single prompt. It follows these steps:

  1. Create an initial summary.
  2. Identify entities that were missing from the initial summary.
  3. Integrate 1-3 more entities into a new summary.
  4. Ensure the new summary is concise, while retaining all entities from the previous iteration.
  5. Repeat this process 5 times, each time incorporating more entities without extending the length of the summary.
  6. Output the final set of summaries as a list of dictionaries in JSON format. Each dictionary contains the keys "missing_entities" and "denser_summary.”

Chain of Density template

You can try out Chain of Density right away via our template in PromptHub.

If you don't have PromptHub access but want to try it out, reply to the email that gets sent when you join the waitlist and I'll share an access code with you.

The chain of density prompt template in the PromptHub dashboard

Experiments

Setup

The researchers sampled 100 articles from CNN to generate CoD summaries. These summaries were evaluated by humans, GPT-4, and were compared to human-written summaries.

Reference points

CoD summaries were compared to human-written summaries and those produced by GPT-4 using a vanilla prompt: "Write a VERY short summary of the Article. Do not exceed 70 words.”

When comparing to human written summaries, the researchers compared the Entities/Tokens (how dense it is) and where the human preference lays in relation to CoD steps.

Human evaluation

First up was determining which of the 5 summaries generated by CoD was preferred. Four of the researchers independently chose their preferred summary for each article by evaluating them on informativeness and readability.

A table showing the scores from individuals related to various steps in CoD

  • Each researcher read through 500 summaries (100 articles * 5 CoD summaries/article)
  • The median preferred CoD step was step 3, indicating a balance in summary density.
  • As the CoD process progressed through more iterations, the content became more evenly distributed, reducing lead bias.

GPT-4 Evaluation

Next up the researchers used GPT-4 to rate the CoD summaries along 5 dimensions: Informative, Quality, Coherence, Attributable, and Overall.

A table showing the scores from GPT-4 related to various steps in CoD and 5 different dimensions

  • Increased detail (densification) is correlated with informativeness.
  • Denser summaries -> lower quality and coherence. Important to note that these quality and coherence don't require the evaluator to reference the original article.
  • The Overall scores skewed toward denser summaries
  • On average across dimensions, the first and last CoD steps were the least favored. Suggesting that either too sparse or too dense summaries were not optimal.
  • The correlation between by GPT-4 and the human evaluations was low, at ~.31, indicating that AI and humans were prioritizing different aspects when judging the summaries.

Where CoD lines up with human preferences

A major component of the research was to find the CoD step that most closely mimicked human written summaries.

The graph below shows the density, measured in Entities/Tokens, for human summaries, the vanilla GPT-4 prompt, and the various steps for CoD. You'll see it is around step 3 that CoD most closely mimics the density of human written summaries.

Trade-offs between readability and informativeness

As we saw in the results, balancing density and readability is important when using CoD.

Here's an interesting example from the article:

2 examples of Chain of Density summaries from step 2 to 3

On the left, the summary is improved with more detail. The addition of the other team in the match (Liverpool), and their goal-scorers provide good context. The denser summary also makes some nice compressions: "a potential route back into the game" -> "a comeback".

On the right, the summary loses quality with more detail. The added details about the network, TV5Monde, seem out of place and are not relevant to the main story about the cyberattack. This is a direct result of having to tighten the previous summary and inject more entities.

Wrapping up

Chain of Density is a simple and effective method for generating summaries that rival human-generated ones. The results suggest sticking to 2 or 3 steps for summaries that are inline with the density that humans prefer. The beauty in the method is its simplicity, requiring just a single prompt. Let us know if this helps your summarization prompts!

Dan Cleary
Founder