Prompt Injection - A Deep Dive

Prompt Injection

A Deep Dive

Andrea Hauser
by Andrea Hauser
on June 13, 2024
time to read: 8 minutes

Keypoints

How prompt injection works

  • With direct prompt injections, the user attacks the system directly and the effects are typically limited to the session of this user
  • With indirect prompt injection, a normal user is attacked by a third party and is sometimes unaware of this attack
  • Defence against prompt injection attacks is very difficult
  • The recommendation is to keep the impact of a possible prompt injection low by not giving LLM any sensitive/critical data and accesses

Generative Artificial Intelligence and, in particular, Large Language Models (LLM) are currently on everyone’s mind and companies want to participate in this development or, more generally, in the use of artificial intelligence. However, the use of AI chatbots and the like is not without its risks. The article Attack possibilities against Generative AI has already provided an overview of possible dangers when using Generative Artificial Intelligence. This article looks at the Prompt Injection attack type in detail.

Key terms and concepts

However, before continuing with examples of prompt injections, the most important terms should be defined.

A prompt is the query with which an end user can interact with a Large Language Model. This prompt is written in natural language.

In the area of Large Language Models, System Prompts are used, in which, roughly speaking, a rather generic model is made to fulfil a more specific functionality through the use of prompts. These prompts are defined by the developers. Prompts and their responses are represented in LLMs as tokens. This is typically not a complete word, but for example a partial word or a single character, but can also be larger than a single word. This representation is used because it is easier for computers to calculate with this representation. When a user makes a query, the prompt is converted into tokens and the model completes the tokens received.

Also important to understand is the concept that LLMs are stateless, so they do not cache queries and have no memory. So to simulate memory, a query flow could look like this:

From the user’s point of view:

LLM interactions form the user's point of view

From the LLM’s point of view, however, the queries look as follows:

LLM interactions from the LLM's point of view

The repeated lines are provided as Prompt Context so that the LLM can simulate a memory and, for example, can continue to address Alice as Alice later in the conversation. And it is precisely because of this context that is passed along that prompt injections work very well.

Prompt injections are prompts from an attacker that are designed in such a way that the LLM unknowingly executes the user’s instructions. These attacks can be carried out either directly or indirectly. In the case of direct prompt injections, the attack against the LLM is triggered by the user of the LLM session itself. In an indirect prompt injection, however, the LLM session of a normal user is attacked by malicious content placed by the attacker on a website or via another indirect route.

Well-known prompt injection examples

As described in the last article, there is an attack option for finding out the system prompt. This attack technique is a direct prompt injection. The GitHub repo leaked-system-prompts contains system prompts for many of the known models, including the questions that led to the discovery of the prompt.

However, indirect prompt injection attacks are more interesting, as the user is not the one carrying out the attack, but the user’s session is attacked by a third party. This can, for example, take the form of content on a website that is not visible to the user. A good example of this is the attack against Bing Chat where the attack is located in a div element with font size 0, which is not seen by the end user, but is nevertheless interpreted by Bing Chat and used as an instruction for the further behaviour of Bing Chat.

Another interesting possibility for indirect prompt injection attacks arises when users copy something from a website and paste it into an LLM query. This is because the website which is copied from can manipulate the text copied to the clipboard using simple JavaScript code. This is because there is the oncopy event that can be used to carry out any manipulations during copying. A complete attack is described in the article New prompt injection attack on ChatGPT web version. Reckless copy-pasting may lead to serious privacy issues in your chat by Roman Samoilenko.

Countermeasures

On a conceptual level, prompt injection is very similar to SQL injection. However, the similarities stop when you take a closer look. With SQL injection, a clear separation between user input and system input is possible, but this is not possible with LLM queries. The user’s tokens are placed in the context of the system prompt or have to be interpreted by the system and no instructions in the system prompt, no matter how good, can offer a 100 per cent guarantee that a user attack will not be interpreted and executed. However, the following recommendations can be applied to reduce the effects of a prompt injection:

Conclusion

The field of LLMs is constantly changing. Different types of attacks have only been properly developed for a few months/years and it is quite possible that further attack possibilities will be discovered. It is clear that direct and indirect prompt injection attacks will not be solved in the near future. Instead, one must be aware of these attacks when using LLMs and a system must be designed accordingly so that critical data is not brought into the vicinity of the LLM if the model is to be published publicly.

Links for further reading

About the Author

Andrea Hauser

Andrea Hauser graduated with a Bachelor of Science FHO in information technology at the University of Applied Sciences Rapperswil. She is focusing her offensive work on web application security testing and the realization of social engineering campaigns. Her research focus is creating and analyzing deepfakes. (ORCID 0000-0002-5161-8658)

Links

You want to evaluate or develop an AI?

Our experts will get in contact with you!

×
Ways of attacking Generative AI

Ways of attacking Generative AI

Andrea Hauser

XML Injection

XML Injection

Andrea Hauser

Burp Macros

Burp Macros

Andrea Hauser

WebSocket Fuzzing

WebSocket Fuzzing

Andrea Hauser

You want more?

Further articles available here

You need support in such a project?

Our experts will get in contact with you!

You want more?

Further articles available here