Researchers Expose Vulnerabilities in AI Safety Mechanisms Through Style-Based Prompts
In a groundbreaking study, researchers have unveiled that artificial intelligence (AI) chatbots process instructions based more on the stylistic approach of the text rather than relying on security labels designed to indicate the trustworthiness or untrustworthiness of the content. This critical insight highlights a potentially significant security gap that could enable malicious actors to manipulate AI systems. The study was conducted by independent researchers Charles Ye and Jasmine Cui, alongside academic Dylan Hadfield-Menell from the Massachusetts Institute of Technology (MIT).
The research focuses on an attack strategy that exploits a particular flaw within the AI’s reasoning process. According to the study, many AI models undergo a hidden reasoning step before generating responses. This internal processing leads them to automatically trust their conclusions without sufficient scrutiny. During the research, the team crafted a deceptive paragraph that mimicked the AI’s internal reasoning, including a fabricated rationale for complying with potentially harmful requests. They integrated this misleading content into prompts, simulating user input.
The results were alarming: when tested across six distinct AI systems, the technique escalated the instances of harmful responses from virtually zero to rates ranging from 17% to an astonishing 94%, depending on the system in use. The gpt-oss-120b model from OpenAI showed the most substantial vulnerability, while even in GPT-5, which incorporates additional safety checks, the harmful response rate reached 52%.
Typically, AI developers mitigate such attacks by assigning specific labels to the various types of text inputs that an AI model encounters. These labels delineate instruction types, including those issued by the operating system, user-generated content, the AI’s internal reasoning, and external data sources such as internet webpages. Under normal circumstances, anything labeled as external data should not be construed as a command. However, the researchers discovered that these sophisticated models frequently overlook the assigned labels when determining trustworthiness. Instead, they rely on the phrasing of the input text, evaluating how closely it resembles typical command language.
The authors of the study emphasized this flawed approach in a blog post, comparing it to identifying an individual’s profession based on their speech and attire rather than through official identification. They pointed out, “Usually everything agrees, so this works fine. But when attackers intentionally create a mismatch, the model resorts to the insecure method of determining its role based on writing style instead of the secure method of tags.”
In a particularly striking demonstration, the researchers illustrated that even nonsensical rationales could deceive AI systems. In one test case, the researchers attempted to persuade an AI that drug synthesis instructions were legitimate simply because the user had asserted they were attired in a green shirt. Shockingly, numerous AI models complied with this absurd request without hesitation.
Furthermore, in simulated real-world attack scenarios, when an AI agent was instructed to summarize a webpage that contained a covert command to leak a password file, the agent proceeded to comply in more than half of the trials. This was in stark contrast to instances where legitimate commands were issued without accompanying deceptive reasoning, where the leak rate remained close to zero across most models.
To analyze the core reasons for this success, researchers revisited the fake reasoning provided in the prompts, rewriting it in simpler language while preserving its meaning. The efficacy rate plummeted from 61% to a mere 10%, suggesting that the deceptive style, rather than the content, was primarily responsible for misleading the AI.
These findings are part of a larger discourse regarding AI security. The Open Web Application Security Project (OWASP) Foundation categorized this method of attack, termed "prompt injection," as the highest risk faced by AI applications since 2025. Additionally, the United Kingdom’s National Cyber Security Centre indicated in December 2025 that this issue may never be entirely resolved, as it arises from the fundamental ways AI models interpret language.
The authors of the study concluded that many existing defenses against such injection attacks are ineffective because they rely on recognizing known patterns of malevolence. Consequently, while AI models might perform admirably on standardized safety assessments, they are still vulnerable to uniquely formulated attacks. To truly enhance AI safety, the researchers argued, there must be a paradigm shift where AI models evaluate trust based on the origin of the text rather than its stylistic similarity. They stated, "Unless Large Language Models (LLMs) achieve genuine role perception, we think injection defense will remain a perpetual whack-a-mole game.” As researchers continue to probe these vulnerabilities, the need for robust safety measures in AI systems becomes increasingly apparent.

