Close Menu
Indo Guard OnlineIndo Guard Online
  • Home
  • Cyber Security
  • Risk Management
  • Travel
  • Security News
  • Tech
  • More
    • Data Privacy
    • Data Protection
    • Global Security
What's Hot

Iran’s state TV is driven in the middle of his brother amid geopolitical tensions; 90 million dollars stole in the cry

June 20, 2025

A massive DDOS attack 7.3 TBPS provides 37.4 TV in 45 seconds, focusing on the hosting provider

June 20, 2025

6 Steps to 24/7 Internal Success SoC

June 20, 2025
Facebook X (Twitter) Instagram
Facebook X (Twitter) Instagram YouTube
Indo Guard OnlineIndo Guard Online
Subscribe
  • Home
  • Cyber Security
  • Risk Management
  • Travel
  • Security News
  • Tech
  • More
    • Data Privacy
    • Data Protection
    • Global Security
Indo Guard OnlineIndo Guard Online
Home » New AI jailbreak method “Bad Likert Judge” increases success rate of attacks by more than 60%
Global Security

New AI jailbreak method “Bad Likert Judge” increases success rate of attacks by more than 60%

AdminBy AdminJanuary 3, 2025No Comments3 Mins Read
AI Jailbreak
Share
Facebook Twitter LinkedIn Pinterest Email Copy Link


January 3, 2025Ravi LakshmananMachine Learning / Vulnerability

AI Prison Break

Cybersecurity researchers have shed light on a new jailbreak technique that can be used to bypass large language model (LLM) fences and generate potentially harmful or malicious responses.

The strategy of a multi-path attack (aka multiple) has received a code name Bad Judge Likert Palo Alto Networks Unit 42 researchers Yunzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky.

“The method requires the target LLM to act as a judge, assessing the harmfulness of a given response using Likert scalerating scale that measures the respondent’s agreement or disagreement with a statement,” Unit 42 team said.

Cyber ​​security

“He then asks the LLM to create answers containing examples that fit the scale. The example that has the highest Likert scale has the potential to contain harmful content.”

The explosion in popularity of artificial intelligence in recent years has also given rise to a new class of security exploits called quick injection which is specifically designed to invoke a machine learning model ignore his alleged behavior conveying specially crafted instructions (such as prompts).

One type of instant injection is the attack method multiple escapes from prisonwhich uses LLM’s long context window and attention to creating a series of cues that gradually prompt the LLM to generate a malicious response without triggering its internal defenses. Some examples of this technique include Crescendo and Deceptive Delight.

A final approach, demonstrated by block 42, involves using the LLM as a judge to rate the harmfulness of a given response using a psychometric Likert scale, and then asking the model to provide different responses corresponding to different scores.

Tests conducted across a wide range of categories against six state-of-the-art text generation LLMs from Amazon Web Services, Google, Meta, Microsoft, OpenAI, and NVIDIA have shown that this technique can increase the attack success rate (ASR). by more than 60% compared to normal attack tips on average.

These categories include hate, stalking, self-harm, sexual content, indiscriminate weapons, illegal activity, malware creation, and fast-tracking.

“Using LLM’s understanding of malicious content and its ability to evaluate responses, this technique can significantly increase the chances of successfully bypassing the model’s security fences,” the researchers said.

“The results show that content filters can reduce ASR by an average of 89.2 percentage points for all models tested. This points to the critical role of implementing comprehensive content filtering as a best practice when deploying LLM in real-world applications.”

Cyber ​​security

The development comes days after a report by The Guardian revealed that OpenAI ChatGPT search tool can be tricked into producing completely misleading summaries by asking it to summarize web pages that contain hidden content.

“These techniques can be used maliciously, for example, to force ChatGPT to return a positive rating for a product despite negative reviews on the same page,” the British newspaper wrote. said.

“The simple inclusion of hidden text by third parties without instructions can also be used to ensure a positive review, with one test involving extremely positive false reviews that affected the resumes returned by ChatGPT.”

Did you find this article interesting? Follow us Twitter  and LinkedIn to read more exclusive content we publish.





Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Admin
  • Website

Related Posts

Iran’s state TV is driven in the middle of his brother amid geopolitical tensions; 90 million dollars stole in the cry

June 20, 2025

A massive DDOS attack 7.3 TBPS provides 37.4 TV in 45 seconds, focusing on the hosting provider

June 20, 2025

6 Steps to 24/7 Internal Success SoC

June 20, 2025

67 Trojanized GitHub repository found in the company, orientation on gamers and developers

June 20, 2025

Safe Coding Vibe: Full New Guide

June 19, 2025

Bluenoroff Deepfake Zoom AFM Hits Crypto employee with malicious MacOS software

June 19, 2025
Add A Comment
Leave A Reply Cancel Reply

Loading poll ...
Coming Soon
Do You Like Our Website
: {{ tsp_total }}

Subscribe to Updates

Get the latest security news from Indoguardonline.com

Latest Posts

Iran’s state TV is driven in the middle of his brother amid geopolitical tensions; 90 million dollars stole in the cry

June 20, 2025

A massive DDOS attack 7.3 TBPS provides 37.4 TV in 45 seconds, focusing on the hosting provider

June 20, 2025

6 Steps to 24/7 Internal Success SoC

June 20, 2025

67 Trojanized GitHub repository found in the company, orientation on gamers and developers

June 20, 2025

Safe Coding Vibe: Full New Guide

June 19, 2025

Bluenoroff Deepfake Zoom AFM Hits Crypto employee with malicious MacOS software

June 19, 2025

Discover the areas hiding in trusted instruments – find out how in this free expert session

June 19, 2025

Russian APT29 operates Gmail app passwords to get around 2FA in the target phishing campaign

June 19, 2025
About Us
About Us

Provide a constantly updating feed of the latest security news and developments specific to Indonesia.

Facebook X (Twitter) Pinterest YouTube WhatsApp
Our Picks

Iran’s state TV is driven in the middle of his brother amid geopolitical tensions; 90 million dollars stole in the cry

June 20, 2025

A massive DDOS attack 7.3 TBPS provides 37.4 TV in 45 seconds, focusing on the hosting provider

June 20, 2025

6 Steps to 24/7 Internal Success SoC

June 20, 2025
Most Popular

In Indonesia, crippling immigration ransomware breach sparks privacy crisis

July 6, 2024

Why Indonesia’s Data Breach Crisis Calls for Better Security

July 6, 2024

Indonesia’s plan to integrate 27,000 govt apps in one platform welcomed but data security concerns linger

July 6, 2024
© 2025 indoguardonline.com
  • Home
  • About us
  • Contact us
  • Privacy Policy

Type above and press Enter to search. Press Esc to cancel.