The system message in AI chat models like GPT-4 or GPT3.5 is an extremely powerful tool for prompt engineers and developers. It can provide context, guidance, and directions to the model, greatly impacting responses.

Just how effective are system messages? We conducted a few experiments, comparing responses side by side to showcase the impact of leveraging system messages. For example, how does the same system message + prompt pairing vary across different OpenAI models?

Additionally, we explored how system messages can be used throughout a conversation to improve results and protect against prompt injections.

What is the System role?

The system role is one of three roles available in conversational AI:

System: Hidden instructions telling the AI how to behave. Example: 'You are an assistant that gives weather updates.’

User: What the person using the AI says. Example: 'What's the weather today?’

Assistant: What the AI says back. Example: 'It's sunny with a high of 75 degrees.’

  
  
    import openai

    openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
      ]
    )
  

Best practices for writing effective system messages

Writing an effective system message is similar to writing an effective prompt. The fundamental principles of clarity, specificity, relevance, and iteration apply to both. For concrete examples of best practices, check out our recent article: 10 Tips for Writing Better Prompts with Any Model.

Using system messages to improve prompt security

System messages are typically used to guide the AI's behavior with a specific persona or focus. Usually this is generally safe, but in some cases a user may by-pass these guidelines (intentionally, or not intentionally).

There are a number of defense measures you can take to make your prompts more secure. We wrote about some here: Understanding Prompt Injections and What You Can Do About Them

However, one powerful method is to append a system message at the end of the conversation to reiterate your most important constraints. You can use this message to reinforce boundaries and guidelines, reducing the risk of undesired outputs or behavior from the AI system.

Check out the examples below to see how the additional assistant message helps keep the AI in line.

AI conversation with system, user and assistant messages
The AI fails to not mention our competitor, Google

AI conversation with system, user and assistant messages
The additional System Message helps "remind" the AI to not mention our competitors

Now, let’s answer a question I’ve had for some time. What tangible difference does adding a system message do to the output of a prompt?

Experiment 1: The impact of providing context via a System Message

In this experiment we are going to test two versions of a prompt. The first version will include context in the system message, while the second version will not provide any context.

Version 1: Context in the system message

System message and prompt in the PromptHub interface
Running this experiment in PromptHub

Version 2: No context provided

Prompt in the PromptHub interface
Running this experiment in PromptHub

Responses

I ran this experiment in PromptHub, using our comparison tools to analyze the responses side-by-side. PromptHub’s Github style versioning and diff checking makes A/B testing like this really simple.

Version 1 (context in system message) is on the left, and Version 2 (no context provided) is on the right. The responses are cut off a little, due to screensize constraints.

Diff checker for prompt outputs
PromptHub's A/B testing tool

Analysis

  • Detail: Version 1 provides a more detailed and authoritative response. Cautioning against "keyword stuffing" and emphasizing the significance of high-quality, engaging, and informative content. (line 3)
  • Scope: Version 2 provides a broader list of considerations, including points not addressed in Version 1 like external linking.
  • Order: The order of points differ slightly, but overall they cover many of the same topics.
  • Language and Style: Version 1 sounds more professional, while Version 2 is more casual. This could be a reflection of the context provided in the system message for Version 1.

These differences show that even just providing the slightest context can make a material impact on the response.

Whether you're building complex prompt chains, or just using an interface like ChatGPT, providing context (preferably in a system message when possible), is a really easy way to get more specific results.

Want to run your own test in PromptHub? Join the waitlist and we will do our best to get you access as soon as possible.

Experiment 2: Placing context in a system message versus the prompt

We now know that providing context results in more specific responses. But does it matter if that context is given in a system message versus the prompt?

In this experiment we are going to test 2 versions of a prompt. The first version will include context in the system message, while the second version will have a blank system message and the context will be provided within the prompt.

Version 1: Context in system message

System message and prompt in the PromptHub interface
Running this experiment in PromptHub

Version 2: Context in user message

System message and prompt in the PromptHub interface
Running this experiment in PromptHub

Responses

Again, I ran this experiment in PromptHub, using our comparison tools to look at the responses side-by-side.

Version 1 (context in the system message) is on the left, and Version 2 (context in the prompt) is on the right. The responses are cut off a little, due to screensize constraints.

Diff checker for prompt outputs
PromptHub's A/B testing tool

Analysis:

A few things jump out to me when looking at the outputs.

  • Depth and specificity: The response from Version 1 dive deeper into each point. (See line 3). Version 2 just says go “after high traffic key words”, where version 1 says look for keywords my audience is actually searching for.
  • Disclaimers: The version 2 includes a disclaimer, probably because the context in the user message prompted the model to adopt a first-person perspective, acknowledging its nature as an AI.
  • Total number of points: Version 2 includes additional factors not in Version 1 ("Page speed" and "User experience"). It looks having the context in the user message prompted the model to consider a broader scope.
  • Interpretation of Context: Version 2's response appears to have more general SEO tips. In contrast, Version 1 focuses on blog posts specifically, suggesting the model interprets the user message more literally, incorporating the context as part of its task.

These differences show how the placement of context can lead to variations in the model’s responses. Even in this small example we start to see that the system message steers the model to be much more specific. This can be useful in some cases, but not necessarily ideal in others.

Want to run your own test in PromptHub? Join the waitlist and we will do our best to get you access as soon as possible.

Experiment 3: Comparing outputs across different models

In this experiment, we will test two versions of the same prompt + system message, across different models. The first version will utilize GPT-3.5 Turbo, while the second version will utilize GPT-4.

GPT-3.5 Turbo is known to have limited awareness of the system message compared to GPT-4, and this experiment aims to put that to the test!

System Message and Prompt

System message and prompt in the PromptHub interface
Running this experiment in PromptHub

Responses

Again, I ran this experiment in PromptHub, using our comparison tools to look at the responses side-by-side.

Version 1 (GPT-3.5 Turbo) is on the left, and Version 2 (GPT-4) is on the right. The responses are cut off a little, due to screensize constraints.

Diff checker for prompt outputs
PromptHub's A/B testing tool

Analysis:

A few things jump out to me when looking at the outputs.

  • Token Count: PromptHub conveniently tracks token counts at the top of each response. We can see that GPT-4 used almost 50% more token than 3.5-Turbo
  • Detail: GPT-4's output provides a more detailed analysis of the specs of the phone (technical information, device comparisons, and in-depth assessments).
  • Order: GPT-4's output follows a more structured approach.
  • Language and Style: GPT-4's response demonstrates a more technical language style, reflecting the specific instructions provided in the system message. GPT-3.5 Turbo uses a more general and casual tone. This particular dimension clearly shows how the models leverage the system message differently.


The system message clearly provides more of a punch when using GPT-4. The output is much more in-depth and detailed compared to GPT-3.5-Turbo.

Takeaways

  • System messages can be interlaced throughout a conversation to help avoid prompt injections and undesired outputs
  • Adding even just a little context can lead to much more specific responses
  • Context provided in a system message yields more specific results than context provided in a user message
  • Different models interpret the system message differently. This is why it is important to always test across models!
  • Testing and iterating on your prompts with various system messages is key (PromptHub is particularly helpful with this)

Dan Cleary
Founder