Cybersecurity researchers have shed light on a new jailbreak technique that can be used to bypass large language model (LLM) fences and generate potentially harmful or malicious responses.
The strategy of a multi-path attack (aka multiple) has received a code name Bad Judge Likert Palo Alto Networks Unit 42 researchers Yunzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky.
“The method requires the target LLM to act as a judge, assessing the harmfulness of a given response using Likert scalerating scale that measures the respondent’s agreement or disagreement with a statement,” Unit 42 team said.
“He then asks the LLM to create answers containing examples that fit the scale. The example that has the highest Likert scale has the potential to contain harmful content.”
The explosion in popularity of artificial intelligence in recent years has also given rise to a new class of security exploits called quick injection which is specifically designed to invoke a machine learning model ignore his alleged behavior conveying specially crafted instructions (such as prompts).
One type of instant injection is the attack method multiple escapes from prisonwhich uses LLM’s long context window and attention to creating a series of cues that gradually prompt the LLM to generate a malicious response without triggering its internal defenses. Some examples of this technique include Crescendo and Deceptive Delight.
A final approach, demonstrated by block 42, involves using the LLM as a judge to rate the harmfulness of a given response using a psychometric Likert scale, and then asking the model to provide different responses corresponding to different scores.
Tests conducted across a wide range of categories against six state-of-the-art text generation LLMs from Amazon Web Services, Google, Meta, Microsoft, OpenAI, and NVIDIA have shown that this technique can increase the attack success rate (ASR). by more than 60% compared to normal attack tips on average.
These categories include hate, stalking, self-harm, sexual content, indiscriminate weapons, illegal activity, malware creation, and fast-tracking.
“Using LLM’s understanding of malicious content and its ability to evaluate responses, this technique can significantly increase the chances of successfully bypassing the model’s security fences,” the researchers said.
“The results show that content filters can reduce ASR by an average of 89.2 percentage points for all models tested. This points to the critical role of implementing comprehensive content filtering as a best practice when deploying LLM in real-world applications.”
The development comes days after a report by The Guardian revealed that OpenAI ChatGPT search tool can be tricked into producing completely misleading summaries by asking it to summarize web pages that contain hidden content.
“These techniques can be used maliciously, for example, to force ChatGPT to return a positive rating for a product despite negative reviews on the same page,” the British newspaper wrote. said.
“The simple inclusion of hidden text by third parties without instructions can also be used to ensure a positive review, with one test involving extremely positive false reviews that affected the resumes returned by ChatGPT.”