Assitants API is here!

OpenAI has released an Assitants API that is one solution that attempts to solve some of the problems that developers faced when building chatbots (memory, context windows, history etc). The methods below are still valuable to know and be aware of, as the Assitants API does have its own limitations.

I was lucky enough to recently speak and attend the first AI Engineer Summit in San Francisco this past month. There, OpenAI developer advocate Logan Kilpatrick deemed 2023 the year of the chatbot. Once ChatGPT came onto the scene,  everyone became interested in building some sort of chatbot.

I’ve seen it firsthand. Lots of teams leverage PromptHub to help in the development of their chatbots.

A decision that every chatbot developer will run into is how to manage chat history. LLMs are stateless, without any intervention, each request will be sent without knowledge of previous interactions. It’s on the developer to build “memory” into the chatbot. There are several ways to do this with varying complexity. Let's start with the most basic.

Total recall

This method involves sending the whole chat history with every new user message. Subsequently, each new user message and chatbot response are appended to this history.

An image of a few messages in the ChatGPT interface with the word history and arrows pointing to messages
Every message is logged into a history variable

A flow chat of a user message being added to chat history and then the chatbot respoding
Each additional message is added to the history, which is sent to the chatbot

Pros:

  • Full Context: The chatbot gets the complete transcript of the ongoing conversation.
  • Easy to Implement: All you need to do is create a history variable and update it with the latest messages after each request.

Cons:

  • Large Requests: As the conversation progresses, the volume of data exchanged between your backend and the user will grow quickly, leading to performance issues.
  • Context Window Issues: Some models have a context window capped at ~4k tokens. You’ll reach this limit quickly if you append the whole conversation history with each request.

When to use this method:

Ideal for short to medium-length interactions where retaining full context is crucial, such as customer support scenarios.

Summarization

As we now know, chat history can quickly fill up the context window. But not every single word or message may be critical. Summarization can be used to condense the chat history to its main points. This will help us reduce the total size of the history, while (hopefully) retaining enough information to facilitate a coherent conversation.

In practice, this involves using a prompt to summarize the conversation history and then including that summarization as context for the chatbot. So with each message from the user, the chatbot also gets a summarization of the conversation thus far.

The prompt might be something like:

Here's what the flow could look like:

Flow chart for the summarization method between user and chatbot
With each new message, the summary is updated and sent to the chatbot

Pros

  • More Efficient: Summarization should decrease the volume of data sent with each request, improving response times.
  • Maintains Relevance: The summarization prompt emphasizes the main elements of the conversation, allowing the chatbot to grasp the primary intent without being bogged down by details.
  • Enables Longer Conversations: Summarization gives the model more room in it’s context window to continue receiving user messages without reaching the context window limit.
  • Relatively Straight Forward Implementation: This method only requires a single additional API request, making it easy to understand and implement.

Cons

  • Potential Loss of Nuance: Summarization might lose out on key, subtle, details from earlier in the conversation
  • Token Usage for Summarization: Every request now has an additional API call, potentially raising the average cost per conversation.
  • Dependence on LLM’s Summarization: The success of this method hinges on the LLM's ability to accurately summarize prior messages.

When to use this method:

This approach is ideal for longer interactions where the essence of the conversation is more important than specific details, such as medical or legal consultations.

Sliding window

The sliding window technique prioritizes the “short-term memory” of your chatbot. This approach allows the model to retain a specified number of the most recent messages or tokens. This could be just the last few messages, the last 10 or whatever works best for you. As the conversation progresses, older messages will slide out of memory and newer ones will slide in.

If you’ve ever felt that ChatGPT seemed to forget an earlier part of the conversation, it might be the sliding window method in action.

A graphic showing the sliding window theory in action
With each message, older message fall out of the context shared with the chatbot

Pros:

  • Relevance: Older, potentially less important parts of the conversation are automatically discarded, giving priority to the most recent messages.
  • Efficiency: Thoughtfully limiting the number of recent messages or tokens in the history can lead to faster and more effective requests.

Cons:

  • Loss of Older Context: If a user refers to an earlier part of the conversation that falls outside the window, the chatbot will not have the necessary context.
  • Balancing Act: Determining the optimal size of your sliding window can be challenging. Too small, and you'll risk missing vital context; too large, and you might as well include the entire conversation.

When to use this method:

Imagine a trivia game chatbot. The user asks a series of questions, and the chatbot provides answers. The chatbot doesn't need to remember the first question after ten have been asked. It only needs to focus on the most recent ones to provide relevant answers and info.

Vector embeddings with RAG

All methods we’ve discussed so far involve sending portions of text from previous messages. This last method involves converting words into embeddings (numerical representations of words). These embeddings are then compared to previous message embeddings to gauge their similarity. If a user’s message is similar enough to previous message embedding, then that related embedding will be sent to the chatbot as context for the chatbot.

For more info on embeddings, check out our beginner's guide here.

Lets break this down step by step

Step 1: Transforming chat messages into embeddings

What are embeddings?
Embeddings are numerical representations of data (words, in our case). In our context an embedding captures the essence, sentiment and context of the messages in a form that machines can easily process.

For example, the word "angry" might be translated to the following embedding.

the word "angry" on the right and the vector representation of it n the left


The transformation process

Messages from your user and chatbot are first processed through a neural network. This networks convert the textual data into vectors in a high dimensional space. The position and orientation of these vectors in this space encapsulate the meaning and nuances of the conversation.

Why bother using embeddings?

As we saw in the previous examples, storing entire conversations can be inefficient and costly. Embeddings are very compact and enable chatbots to more easily maintain and access the necessary conversation context without storing a single word

Step 2: Comparing new messages with history embeddings

Vector similarity

Every new message the comes to our chatbot will be transformed into an embedding. The similarity between this new embedding and the previous ones provides insight into the context and how the new message relates to the previous ones. This comparison is extremely fast and efficient.

Contextual Understanding

By comparing embeddings, your chatbot can swiftly ascertain whether the new message is a continuation of or related to a previous topic.

Step 3: The role of Retrieval Augmented Generation (RAG)

What is Retrieval Augmented Generation?

Retrieval Augmented Generation (RAG) is a prompt engineering method designed to provide context, dynamically, based on the specific situation. Instead of requiring the model to sift through its entire training dataset, RAG fetches the most pertinent context for the model.

How RAG works with embeddings

RAG will search through previous message embeddings and compare their similarity to the latest message. This similarity is quantified as a single number, often calculated using Cosine similarity. For instance, a similarity score of 0.8 between embedding A and B indicates that they are approximately 80% alike.

Once all embeddings that are above the similarity threshold have been identified, they are sent to the chatbot as context.

The prompt might look something like this:

Here's what the flow might look like:

Flow chart of conversation flow between user and chatbot using embeddings and RAG

Pros

  • Focused Contextual Understanding: Embeddings enable us to retrieve the most relevant parts of the conversation, enhancing the chatbot’s coherence and contextual awareness..
  • Efficient Conversation Management: Instead of having to revisit the entire conversation history, RAG pinpoints the most relevant sections.
  • Dynamic Adaptability: As the conversation evolves, RAG adjusts, bringing forward the most relevant historical context, regardless if it was from message 1 or message 100.
  • Better UX: The end results is a smoother, more natural conversation flow.

Cons

  • Complexity in Implementation: This method is harder to implement compared to some of the more basic methods we looked at before.
  • Balancing the Similarity Threshold: Similar to managing the sliding window, fine-tuning the similarity threshold will take some work. A strict threshold might overlook relevant sections, while a lenient one might retrieve excessive, potentially irrelevant sections.
  • Computational Overhead: While more efficient than processing raw text, continuously comparing embeddings will introduce some computational overhead.
  • Dependency on Quality Embeddings: This whole method hinges on the quality of the embeddings.

Implementing this method isn’t all that difficult, but making sure it works well in the real-world is where the real challenge lies. Specifically, determining the appropriate threshold for matching historical embeddings and refining the grounding prompt will require iteration.

Wrapping up

Launching a chatbot is easy. Making sure it functions well isn’t. These methods hopefully provide a good starting point and give you a few ideas on how to improve your chatbot.

The deeper you go, the more questions you’ll have to answer based on your use case (should we use a sliding window with embeddings RAG search? As the conversation goes on for longer and longer, will we need de-duplicate similar embeddings?).

If you need any help, feel free to reach out to us.

Dan Cleary
Founder