We just recently launched Anthropic’s models on PromptHub. Their larger context windows allow for way more token usage. I’ve been chatting with users, and this sentiment keeps coming up.

These conversations sparked some curiosity and so we decided to conduct 2 experiments:

  1. Benchmark average response times for OpenAI, Azure, and Anthropic models, based on token output.
  2. Does input token length have any affect on response times?

LLM response time should be linear with respect to output token count. This is because input tokens can be processed all at once, in parallel, because the model has all the information. However, output tokens are generated sequentially (more on that in a great article here).

Latency across different models

Our goal was to understand how much latency each additional output token adds. So we ran a few prompts, acrosss 8 models, at various outputs lengths, and here are the results.

The x-axis is the output token count, the y-axis is response time, each point represents a single response, and the line is the line of best fit.

8 graphs for different models, all showing the the relationship between output tokens and response times

Here are the averages

A table of models and their relative latency, in milliseconds, per output token generated

Takeaways

  • Azure and OpenAI have roughly the same speed for GPT 3.5
  • For GPT-4, Azure is three times faster than OpenAI
  • For 3.5-Instruct, Azure is 1.5 times faster than OpenAI.
  • Claude 2, Anthropic’s most capable model, is faster than OpenAI’s hosted GPT-4, but this isn't the case when GPT-4 is hosted on Azure
  • Within OpenAI, GPT-4 is almost three times slower than GPT-3.5 and 6.3 times slower than GPT-Instruct

Based on these values, you can estimate the response time for any API call. Several factors can affect response times outside of output tokens, such as network conditions, but this should get you in the right ballpark figure.

For example, a request to Claude 2 with 1000 output tokens should take roughly 31 seconds.

31ms/token * 1000 tokens = 31 seconds

These numbers are certainly going to change as the models continue to develop. Given that, we will continue to update these numbers and send them out monthly. If you’d like to get the latest report and monthly updates directly in your inbox, feel free to join our list here.

Is response time independent of of input tokens?

As mentioned above, latency should scale linearly with output length, and for most of the models in our experiment, this holds true. The points fall very close to the line of best fit.

But when we look at the graphs for the Claude models, we started to see some scatter.

Graph for Claude 2 showing output tokens vs response times

Graph for Claude Instant showing output tokens vs response times

When controlling for output tokens, the correlation between input tokens and latency for the Claude models are:

  • For Claude 2, the correlation is .749, meaning for every additional 500 input tokens, the latency increases by approximately 0.53 seconds.
  • For Claude Instant, the correlation is .714; for every additional 500 input tokens, the latency increases by approximately 0.29 seconds.

What does this mean? A reasonable deduction is that Anthropic’s models aren’t as efficient as OpenAI’s in parallel token processing.This inefficiency could stem from various reasons, offering a glimpse into the workings of these proprietary models.

Wrapping up

For those of you building in AI, here are actionable tips you can use today:

  • If you're using OpenAI consider to switching to hosting on Azure
  • Test out the new GPT-3.5-Instruct model. It's ridiculously fast and performant
  • Do some testing! These models change so often, that it is important to continually test your prompts. PromptHub makes this easy with our side-by-side testing, batch testing, and multi-provider support. Join the waitlist today.

Got any feedback? Let us know, we want to make this as valuable as possible.

Dan Cleary
Founder