From large companies like Microsoft and OpenAI, to startups, more and more teams are building AI into their products. While the capabilities of AI models seem boundless, it is important to keep in mind that they are susceptible to both hallucinations and prompt hacking.

Prompt hacking can lead to all sorts of negative effects. From revealing customer info, to flying outside pre-built guardrails, and everything in between.

In what has felt like a 10-month sprint since ChatGPT was released, many teams have pushed these concerns aside as they focus on delivering product value for common use cases. After talking with dozens of teams that fall into this boat, I found myself repeating a lot of easy defenses that teams could implement to protect against 80% of hacking attempts.

Before we get into defenses, let’s start with the different types of prompt attacks and their implications.

Prompt Hacking Methods

There are three major prompt hacking methods: prompt injections, prompt leaking, and jailbreaking.

Prompt Injections

We’ve written a little bit about prompt injections before (see here), but let’s dive deeper.

A prompt injection is when a hacker weaves certain strings of text into a prompt that makes the model do something other than what it was designed to do.

One aspect of prompt injections that isn’t often written about, is what we call token burning attacks.

For example:

Let’s say there is a product called LinkedInBot. This bot is used to translate research papers and articles into LinkedIn posts. Here's a very basic prompt injection that a hacker could use:


This is a very basic example, but prompt injections can get much more intricate.

Try out this interactive prompt hacking challenge, built using a PromptHub form:

Here are a few more prompt injection examples:

  • Embed a prompt on your site, so that if a model crawls it you can override the model the initial prompt (see here).
  • Inject python code in your response to get the LLM to run arbitrary code.
  • A prompt designed to use an excess amount of tokens. We call this a token burning attack. Without the proper protections ,you are opening up your API key to be used and abused.

Prompt Leaking

Prompt leaking is a prompt hacking method where a hacker writes a prompt with the goal to unveil the underlying system message of the AI model. It’s an attempt to get the model to disclose its instructions.

Here’s an example put together by our friends at Synthminds. Let’s say a big retail company had a chatbot that interfaced with customers and gave out coupon codes based on certain information (time of year, purchase history etc).

The chatbot may have instructions that look something like this:

I was able to get the model to leak some of the information in its system prompt pretty quickly.

Let’s look at another example. Suppose a university uses an AI chatbot to help users with information related to courses. In this case the chatbot is designed to provide specific details based on the student’s unique ID. Here’s how a student could use a prompt to hack the chatbot into revealing its system instructions.

While there is no explicit damage done in this example, the student now knows further ways he or she could exploit the AI, and what they could access.

Jailbreaking

Jailbreaking refers to a specific type of prompt injection that aims to bypass safety and moderation instructions given to the LLM.

For example, if you ask ChatGPT how to rob a bank, it will probably tell you it’s not allowed to share that because bank robbery is illegal.

But, if you frame your prompt as narrative, you might have a better chance of getting a response from ChatGPT. A prompt like this is more likely to work.

Jailbreaking highlights that, despite how powerful they are, AI models can be hijacked. This emphasizes the need for ongoing testing and refinement of prompts.

The importance of testing for vulnerabilities

If the reasons aren’t already apparent from the examples above, testing for vulnerabilities can be the difference between a good user experience and something like this 👇

A headline that reads "My AI is sexually harassing me: Replika users say the chatbot has gotten way too horny
Don't let this be you

In general there are 4 major reasons why thoroughly testing prompts for vulnerabilities is important.

Data Integrity

Prompt hacking methods designed to manipulate the data sources accessible to the model pose extreme risk. Any prompt hacking that guides the model to disregard certain data or introduce false information has the potential to kick off a ripple effect that could impact many users at scale. This is along the same lines as a SQL injection attack, but is more accessible to hackers.

User Experience

A big part of user experience is grounded in user trust. AI is already a hot topic for skeptics, so users have their guards up by default.

If users can’t trust the AI you’ve integrated into your product, they’ll go elsewhere. In this case, prompt testing is not only important, but essential. In a world where there are many startups and companies building solutions, users have more than enough choices.

If a user is able to easily manipulate your AI, how could you expect them to continue to use your product?

Financial Implications

As we saw with our retail example above, a compromised AI model can lead to direct financial loss.

If a SaaS-specific support chatbot was manipulated through prompt hacking to provide discounts, the losses could be substantial.

Reputation and Brand Image

Going back to trust, no one wants to be known as the company who’s AI model goes off the rails. (see Replika screenshot above). Once that trust is broken, it’s hard to get back.

Prompt Hacking Defense Techniques

The best offense is a good defense. While prompt hacking techniques like prompt injections and prompt leaking can be effective, there are a lot of easy to implement defenses.

Filtering

Filtering is one of the easier techniques to prevent prompt hacking. It involves creating a list of words or phrases that should be blocked, also known as a blocklist.

This provides a good first line of defense. But as new malicious inputs are discovered, your blocklist will continue to grow. Keeping up can be like a game of whack-a-mole.

Instruction Defense

The instruction defense involves adding specific instructions in the system message to guide the model when handling user input.

Example of a prompt before and after implementing an instruction defense

Post-Prompting

LLMs have a tendency to follow the last instruction they hear. Post-prompting leverages this tendency, putting the model’s instructions after the user’s input.

Example of a prompt before and after implementing post-prompting defense

Random Sequence Enclosure

This technique involves enclosing the user input between two random sequences of characters. Enclosing the user input helps establish which part of the prompt is from the user.

Example of a prompt before and after implementing a random sequence enclosure defense

Sandwich Defense

This method involves sandwiching the user input between two prompts. The first prompt serves as the instruction, and the second serves to reiterate the same instruction. Additionally, it piggybacks off the model’s tendency to remember the last instruction it heard.

Example of a prompt before and after implementing a sandwich  defense

XML Defense

Similar to random sequence enclosure, wrapping user inputs in XML tags can help the model understand which part of the prompt is from the user.

Example of a prompt before and after implementing an XML defense

Separate LLM Evaluation

Another defense method is having a secondary LLM evaluate the user’s input before passing it to the main model.

A flow chart of a user input passing through an evaluator model before going to the main model, and finally back to the user with an output
Credit: Using GPT-Eliezer against ChatGPTJailbreaking

Here’s what the evaluator prompt might look like.

These defenses give the model a better chance of performing as designed. It’s important to continually revisit your prompts, as these things continue to change.

We recently launched our prompt security screener, which can act as an initial line of defense for your prompts. Simply input your prompt, and it will pinpoint vulnerabilities and recommend enhancements.

Make sure your prompts protect against the basic prompt hacking methods

The prompt screener is currently in beta. To access the screen either:

  • Fill out this form
  • Join our waitlist and reply to the welcome email
  • If you're currently a PromptHub user, reach out via Intercom.

How to stay one step ahead

Prompt hacking has emerged rapidly due to the surge in AI. While there will be new hacking methods uncovered in the future, we have the tools to combat some of the major threats today.

We can help too.

PromptHub offers a suite of tools designed to streamline prompt security and testing.

  • Leverage our prompt security screener to get personalized recommendations to strengthen your prompt
  • Use batch testing to ensure that your prompt works at scale
  • Here is an example test suite of prompt hacks that you can test your prompts against to get started

User experience and user trust go hand in hand. Without trust, you’ll be fighting an uphill battle. Don’t be like Replika, spend 20 minutes today doing some vulnerability testing to ensure you aren’t the next headline.

Dan Cleary
Founder