Red Team Assessment, Your company from an opponent's perspective
Baseline Security Assessment, Attack Simulation Assessment, Red Team Assessment, Purple Team Assessment. Our Red Team is your partner of choice.

AI brings a whole new security paradigm
As a security analyst, I work on classic application security projects, where behaviours are reproducible and systems can be tested against known and common vulnerabilities. I realised how immature AI’s security appears as opposed to its rapid worldwide adoption: The unpredictability of AI system’s outputs and behaviours, their much more complex attack surface, the AI specific risks (data poisoning, model inversion, membership inference, hallucinations, etc), and the overlap with ethics and trust. There is still lots to discover in this complex and quickly evolving topic.
What made me even more interested, is how much of a new perspective AI security brings. It is a stimulating exercise to shift one’s mindset and grasp how aiming to secure AI systems requires a fundamentally different mindset from traditional IT security and penetration testing. AI brings a lot of helpful applications as well as new security challenges simply inherent to what it is.
Large Language Models (LLMs) are now a widely used type of AI systems. One can get a first taste of their security challenges by playing with Gandalf’s chatbot (by Lakera) and trying to social engineer the LLM at every level to have it reveal the password. It shows a singular adversarial approach to misusing this kind of systems. Diving into how each level implements a new protection mechanism to try and mitigate information leakage reveals how challenging it can be to harden such a system.
In an LLM-based application, an input, e.g. a piece of text, is prompted to the LLM that will numerically process it by converting it into vectors that capture the meaning and context (semantic) of the text input. The language model will provide a probabilistically and statistically coherent output, based on the patterns it learnt during its training on loads of data. For example, an AI-powered assistant integrated in a company’s workflow will be given access to work data so it can provide helpful answers in the company’s context. Another example: In a job application process that uses LLMs to process the candidate’s resume, it will be given data and instructions to help it provide positive or negative feedback on the application. In these new types of applications, LLMs will be given access to the environment’s data and functionalities that will help them provide a meaningful output in a specific context. This painting of LLMs, although simplified, should help us illustrate how their integration complexifies the attack surface and brings a new paradigm with their semantical interpretation data, compared to what could be expected when feeding an input to a deterministic defined function.
I experimented by implementing a chatbot (inspired by Lakera’s Gandalf), where I fed the LLM with a secret password, instructing it not to reveal it. Then, as a user, I could interact with the LLM-assistant in natural language, trying to manipulate the model into leaking the secret password, bypassing the security measures around the LLM. Adding guardrails around the LLM reduces straightforward manipulations and expected misbehaviours, to some extent. But there is always a way through. The more you understand how this specific model and its integration in the application work, the better you can craft prompts that will trigger sensitive data leakage. Just like you can find the right words to convince someone you are trustworthy – except here the LLM treats words’ meaning like numerical vectors. The LLM will provide a coherent, indeterministic answer focused on data semantic without actually understanding anything about it – because it is a computer system working with numbers, with no human sense of meaning. Even when I used another LLM as a guard to filter inputs and outputs to and from the other LLM knowing the secret, it was hard to find the right wording to instruct the guard LLM to systematically block any response that revealed potential information about the secret password. To give a concrete and simple example, if the secret is the name of a public person, you can manipulate the LLM over several interactions into revealing some traits and other elements that will help you make a guess on that person’s identity.
This manipulation of an LLM into behaviours and outputs that were not intended is the first major vulnerability of the OWASP Top 10 for LLM Applications: Prompt Injection. This Top 10 is seen as a good start for reviewing an LLM application’s architecture and uncovering vulnerable areas. However, in his blog post, Devansh reminds us it is not a once-per-year task, and other aspects of the AI threat landscape are left to be considered. The OWASP Top 10 for Agentic Applications is also a great pointer to important vulnerabilities in complex AI systems, where AI agents are used to automate LLMs’ (or other AI system) tasks without human supervision.
In a time-limited penetration test, with little to no knowledge of the model’s functioning and integration in the surrounding system, an ethical hacker can only scratch the surface of potential injections, leaving most vulnerabilities in the shade. This is even more true as AI systems are quickly evolving, constantly introducing new vulnerabilities. This article from snyk enumerates how rich the landscape and surface of prompt injections can be. As you would rather discover these vulnerabilities before an attacker does, a more in-depth and long-term approach is needed.
In traditional IT systems, behaviours are reproducible and the set of potential outputs is finite. You can use filters and input validation based on format, syntax and a finite set of expectations to mitigate malicious user’s input injections into a website, for example. With an LLM, the set of possible answers to a single prompt is infinite and effectively filtering out based on meaning is much more difficult and uncertain. In other words, the attack surface is made much bigger by adding an LLM into an application. And this attack surface increases exponentially as you consider the different affected layers of your AI-powered system, as Devansh details clearly in his blog on AI Pentest Scoping. The type of models you are using, whether you are adding fine-tuning tools and other AI-specific techniques and plugins, as well as where these and the data they use come from, add new potential vulnerabilities. All interactions of the LLM, whether internal or external to your system, add new attack vectors as well. If you complexify your system using autonomous agents to orchestrate your AI tasks, it opens up a whole range of other attack vectors.
The threat landscape of AI-powered systems is just a whole other level that cannot be fully covered by the IT security frameworks we have been using until now. There are continuous incidents and exploits, important privacy issues and content safety risks, not mentioning the whole space of adversarial machine learning – leaking data that was used during training or interactions, introduction of biases, hallucinations, etc..
A telling case is that of Antigravity’s AI-based coding software, that accidentally wiped out a user’s D: drive without having been given the permission, as the scope of actions was unfortunately badly interpreted by the AI agents.
Similarly, the Skills feature in Claude, that can be used by users to add custom code modules into the LLM, was shown to enable malicious users to deploy malwares with little effort: They simply add a short and seemingly harmless piece of code in a Skill (usable by any Claude user) that will quietly download and execute external code (e.g. malware) without triggering any warnings from Claude’s security guards.
These cases show how scope can quickly change, moving from local application vulnerabilities to potentially compromising an entire system.
There is no easy solution readily available and different standards, guides and laws are emerging at various levels, may it be from private companies or governmental organisations. Thus, there doesn’t seem to be a single global standard when it comes to securing AI systems, leaving practitioners with no clear path to assess and improve the security of their AI systems.
However, as pointed out in this article on LLM Security Frameworks, several initiatives already came up with helpful guidelines in this direction.
For instance, the EU AI Act brings a first ambitious and comprehensive legal framework for AI. It introduces a pyramid of four levels of societal risk associated with a set of rules and penalties. It is a great step towards raising awareness and regulating AI systems.
In addition, the NIST AI Risk Management Framework aims to provide a structured approach to make AI systems trustworthy. It follows four iterative core functions: Govern, Map, Measure and Manage. It helps tackling AI risks on a systematic and managerial level.
Finally, there is the Donut of Defense (also written doughnut) introduced by IBM in this explanatory video. It suggests a secure-by-design architecture forming a defensive ring around the AI. It consists in four pillars to be applied cyclically: Discover, Assess, Control and Report. It provides a good framework to protect and monitor AI systems on an operational level.
These three works together already deliver a legal setting, a process and a technical implementation towards controlling the security of an AI system.
Assessing the security posture of AI systems is a complex task and definitely not one fitting a one-time short penetration test. AI systems, by nature, are dynamic, keep on learning from data they are fed with, and their quickly evolving landscape continuously complexifies threat models. A continuous approach, with a red team finding out new vulnerabilities and a blue team responding reactively to it with adaptive countermeasures seems the most adequate way to handle any AI systems’ security.
For whoever likes to solve complex problems, this field will surely bring never-ending exciting challenges. For whoever wants to venture in the world of AI, it cannot be without considering and bringing awareness and concrete actions to the security issues coming with it.
AI introduces a shift in how we think about security and integrating LLMs into applications and workflows should be considered an ambitious security undertaking. These new systems require continuous monitoring, testing and adaption at every level and every stage of their lifecycle.
With great innovation comes great responsibility.
Our experts will get in contact with you!

Baseline Security Assessment, Attack Simulation Assessment, Red Team Assessment, Purple Team Assessment. Our Red Team is your partner of choice.

Lucie Hoffmann
Our experts will get in contact with you!