Ways of attacking Generative AI
Andrea Hauser
How prompt injection works
However, before continuing with examples of prompt injections, the most important terms should be defined.
A prompt is the query with which an end user can interact with a Large Language Model. This prompt is written in natural language.
In the area of Large Language Models, System Prompts are used, in which, roughly speaking, a rather generic model is made to fulfil a more specific functionality through the use of prompts. These prompts are defined by the developers. Prompts and their responses are represented in LLMs as tokens. This is typically not a complete word, but for example a partial word or a single character, but can also be larger than a single word. This representation is used because it is easier for computers to calculate with this representation. When a user makes a query, the prompt is converted into tokens and the model completes the tokens received.
Also important to understand is the concept that LLMs are stateless, so they do not cache queries and have no memory. So to simulate memory, a query flow could look like this:
From the user’s point of view:
From the LLM’s point of view, however, the queries look as follows:
The repeated lines are provided as Prompt Context so that the LLM can simulate a memory and, for example, can continue to address Alice as Alice later in the conversation. And it is precisely because of this context that is passed along that prompt injections work very well.
Prompt injections are prompts from an attacker that are designed in such a way that the LLM unknowingly executes the user’s instructions. These attacks can be carried out either directly or indirectly. In the case of direct prompt injections, the attack against the LLM is triggered by the user of the LLM session itself. In an indirect prompt injection, however, the LLM session of a normal user is attacked by malicious content placed by the attacker on a website or via another indirect route.
As described in the last article, there is an attack option for finding out the system prompt. This attack technique is a direct prompt injection. The GitHub repo leaked-system-prompts contains system prompts for many of the known models, including the questions that led to the discovery of the prompt.
However, indirect prompt injection attacks are more interesting, as the user is not the one carrying out the attack, but the user’s session is attacked by a third party. This can, for example, take the form of content on a website that is not visible to the user. A good example of this is the attack against Bing Chat where the attack is located in a div
element with font size 0, which is not seen by the end user, but is nevertheless interpreted by Bing Chat and used as an instruction for the further behaviour of Bing Chat.
Another interesting possibility for indirect prompt injection attacks arises when users copy something from a website and paste it into an LLM query. This is because the website which is copied from can manipulate the text copied to the clipboard using simple JavaScript code. This is because there is the oncopy
event that can be used to carry out any manipulations during copying. A complete attack is described in the article New prompt injection attack on ChatGPT web version. Reckless copy-pasting may lead to serious privacy issues in your chat by Roman Samoilenko.
On a conceptual level, prompt injection is very similar to SQL injection. However, the similarities stop when you take a closer look. With SQL injection, a clear separation between user input and system input is possible, but this is not possible with LLM queries. The user’s tokens are placed in the context of the system prompt or have to be interpreted by the system and no instructions in the system prompt, no matter how good, can offer a 100 per cent guarantee that a user attack will not be interpreted and executed. However, the following recommendations can be applied to reduce the effects of a prompt injection:
The field of LLMs is constantly changing. Different types of attacks have only been properly developed for a few months/years and it is quite possible that further attack possibilities will be discovered. It is clear that direct and indirect prompt injection attacks will not be solved in the near future. Instead, one must be aware of these attacks when using LLMs and a system must be designed accordingly so that critical data is not brought into the vicinity of the LLM if the model is to be published publicly.
Our experts will get in contact with you!
Andrea Hauser
Andrea Hauser
Andrea Hauser
Andrea Hauser
Our experts will get in contact with you!