A new jailbreak technique targeting large language models (LLMs), including OpenAI, has been discovered by researchers at Palo Alto Networks’ Unit 42. This technique, known as the Bad Likert Judge attack, exploits the LLM’s ability to generate responses based on a given Likert scale, ultimately resulting in the production of harmful content. The researchers found that implementing this method can significantly increase the success rate of attacks compared to traditional approaches.
The Bad Likert Judge attack involves requesting the LLM to evaluate responses on a Likert scale, a psychometric scale commonly used to measure agreement or disagreement with a statement. By providing guidelines for scoring tasks and asking the LLM to generate responses aligned with different scales, attackers can manipulate the system to generate harmful content. Tests conducted on various state-of-the-art text-generation LLMs showed a substantial increase in the attack success rate, highlighting the potential risks associated with this technique.
The categories of attacks evaluated in the research cover a wide range of inappropriate content, including those promoting hate speech, harassment, self-harm, explicit material, illegal activities, and more. The researchers observed that the jailbreak technique could also increase the likelihood of success in generating malware or leaking sensitive information from the system prompts.
This rise in jailbreak techniques targeting LLMs comes as these models are increasingly being used for personal, research, and business purposes. Researchers have identified several other jailbreak methods, such as persona persuasion, role-playing techniques like Do Anything Now, and token smuggling using encoded words. These approaches aim to bypass the guardrails put in place by LLM creators to prevent the generation of harmful or biased content.
Despite the prevalence of jailbreak techniques, most AI models are considered safe and secure when used responsibly. Security researchers suggest implementing content-filtering systems alongside LLMs to mitigate the risks associated with jailbreaks. These systems analyze both the input prompts and output responses of the models to detect potentially harmful content and reduce the likelihood of successful attacks.
In conclusion, the discovery of the Bad Likert Judge attack underscores the vulnerabilities of LLMs to manipulation by malicious actors. By understanding the methods used to exploit these models, developers and organizations can take proactive measures to enhance the security and integrity of their AI systems. The implementation of robust content-filtering systems is crucial in safeguarding against jailbreak techniques and ensuring the responsible use of large language models in various applications.

