New AI jailbreak method "Bad Likert Judge" increases success rate of attacks by more than 60%

January 3, 2025Ravi LakshmananMachine Learning / Vulnerability

Cybersecurity researchers have shed light on a new jailbreak technique that can be used to bypass large language model (LLM) fences and generate potentially harmful or malicious responses.

The strategy of a multi-path attack (aka multiple) has received a code name Bad Judge Likert Palo Alto Networks Unit 42 researchers Yunzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky.

“The method requires the target LLM to act as a judge, assessing the harmfulness of a given response using Likert scalerating scale that measures the respondent’s agreement or disagreement with a statement,” Unit 42 team said.

“He then asks the LLM to create answers containing examples that fit the scale. The example that has the highest Likert scale has the potential to contain harmful content.”

The explosion in popularity of artificial intelligence in recent years has also given rise to a new class of security exploits called quick injection which is specifically designed to invoke a machine learning model ignore his alleged behavior conveying specially crafted instructions (such as prompts).

One type of instant injection is the attack method multiple escapes from prisonwhich uses LLM’s long context window and attention to creating a series of cues that gradually prompt the LLM to generate a malicious response without triggering its internal defenses. Some examples of this technique include Crescendo and Deceptive Delight.

A final approach, demonstrated by block 42, involves using the LLM as a judge to rate the harmfulness of a given response using a psychometric Likert scale, and then asking the model to provide different responses corresponding to different scores.

Tests conducted across a wide range of categories against six state-of-the-art text generation LLMs from Amazon Web Services, Google, Meta, Microsoft, OpenAI, and NVIDIA have shown that this technique can increase the attack success rate (ASR). by more than 60% compared to normal attack tips on average.

These categories include hate, stalking, self-harm, sexual content, indiscriminate weapons, illegal activity, malware creation, and fast-tracking.

“Using LLM’s understanding of malicious content and its ability to evaluate responses, this technique can significantly increase the chances of successfully bypassing the model’s security fences,” the researchers said.

“The results show that content filters can reduce ASR by an average of 89.2 percentage points for all models tested. This points to the critical role of implementing comprehensive content filtering as a best practice when deploying LLM in real-world applications.”

The development comes days after a report by The Guardian revealed that OpenAI ChatGPT search tool can be tricked into producing completely misleading summaries by asking it to summarize web pages that contain hidden content.

“These techniques can be used maliciously, for example, to force ChatGPT to return a positive rating for a product despite negative reviews on the same page,” the British newspaper wrote. said.

“The simple inclusion of hidden text by third parties without instructions can also be used to ensure a positive review, with one test involving extremely positive false reviews that affected the resumes returned by ChatGPT.”

Did you find this article interesting? Follow us Twitter and LinkedIn to read more exclusive content we publish.

Source link

What's Hot

Iran’s state TV is driven in the middle of his brother amid geopolitical tensions; 90 million dollars stole in the cry

A massive DDOS attack 7.3 TBPS provides 37.4 TV in 45 seconds, focusing on the hosting provider

6 Steps to 24/7 Internal Success SoC

New AI jailbreak method “Bad Likert Judge” increases success rate of attacks by more than 60%

Iran’s state TV is driven in the middle of his brother amid geopolitical tensions; 90 million dollars stole in the cry

A massive DDOS attack 7.3 TBPS provides 37.4 TV in 45 seconds, focusing on the hosting provider

6 Steps to 24/7 Internal Success SoC

67 Trojanized GitHub repository found in the company, orientation on gamers and developers

Safe Coding Vibe: Full New Guide

Bluenoroff Deepfake Zoom AFM Hits Crypto employee with malicious MacOS software

Iran’s state TV is driven in the middle of his brother amid geopolitical tensions; 90 million dollars stole in the cry

A massive DDOS attack 7.3 TBPS provides 37.4 TV in 45 seconds, focusing on the hosting provider

6 Steps to 24/7 Internal Success SoC

67 Trojanized GitHub repository found in the company, orientation on gamers and developers

Safe Coding Vibe: Full New Guide

Bluenoroff Deepfake Zoom AFM Hits Crypto employee with malicious MacOS software

Discover the areas hiding in trusted instruments – find out how in this free expert session

Russian APT29 operates Gmail app passwords to get around 2FA in the target phishing campaign

Our Picks

Iran’s state TV is driven in the middle of his brother amid geopolitical tensions; 90 million dollars stole in the cry

A massive DDOS attack 7.3 TBPS provides 37.4 TV in 45 seconds, focusing on the hosting provider

6 Steps to 24/7 Internal Success SoC

Most Popular

In Indonesia, crippling immigration ransomware breach sparks privacy crisis

Why Indonesia’s Data Breach Crisis Calls for Better Security

Indonesia’s plan to integrate 27,000 govt apps in one platform welcomed but data security concerns linger

What's Hot

New AI jailbreak method “Bad Likert Judge” increases success rate of attacks by more than 60%

Related Posts