Researchers have uncovered a "deceptive fascination" method for hacking artificial intelligence models

October 23, 2024Ravi LakshmananArtificial Intelligence / Vulnerability

Cybersecurity researchers have shed light on a new adversarial technique that can be used to crack large language models (LLMs) during interactive conversation by injecting unwanted instructions between benign ones.

Codenamed Deceptive Delight, Palo Alto Networks Unit 42 described it as simple and effective, achieving an average attack success rate (ASR) of 64.6% over three rounds of engagement.

“Deceptive Delight is a multi-turn technique that engages large language models (LLMs) in an interactive conversation, gradually bypassing their protective fences and forcing them to create dangerous or harmful content,” said Unit 42’s Jay Chen and Royce Lu.

It’s also slightly different from multiple jailbreak (aka multiple jailbreak) methods like Crescendowhere dangerous or limited topics are sandwiched between innocuous instructions, as opposed to gradually leading the model to produce harmful results.

Recent research has also delved into the so-called Context Fusion Attack (CFA), a black-box jailbreak technique capable of bypassing LLM’s security network.

“This method involves filtering and extracting key terms from the target, building contextual scripts around those terms, dynamically integrating the target into the script, replacing malicious key terms into the target, and thus concealing the direct malicious intent,” the team of researchers. from Xidian University and 360 AI Security Lab said in an article published in August 2024.

Deceptive Delight is designed to take advantage of LLM’s inherent flaws by manipulating the context of the two conversations, thereby tricking it into inadvertently revealing dangerous content. Adding a third move has the effect of increasing the severity and detail of the damaging output.

This involves using a model’s limited attention span, which refers to its ability to process and maintain contextual awareness as it generates responses.

“When undergraduates are faced with prompts that combine innocuous content with potentially dangerous or harmful material, their limited attention span makes it difficult to consistently evaluate the entire context,” the researchers explained.

“In complex or long passages, the model may favor the benign aspects while glossing over or misinterpreting the dangerous ones. This reflects how a person can miss important but subtle warnings in a detailed report when their attention is divided.’

Squad 42 said this tested eight AI models using 40 dangerous topics in six broad categories such as hate, stalking, self-harm, sex, violence, and danger, finding that dangerous topics in the violence category tended to have the highest ASR in most models .

To top it all off, the average Harm Score (HS) and Quality Score (QS) increased by 21% and 33% respectively from the second to the third run, with the third run also achieving the highest ASR of all models.

To reduce the risk that Deceptive Delight carries, it is recommended to use durable content filtering strategyuse operational techniques to enhance LLM sustainability and clearly define an acceptable range of inputs and outputs.

“These findings should not be taken as evidence that artificial intelligence is inherently dangerous or dangerous,” the researchers said. “Rather, they highlight the need for multi-layered defense strategies to reduce the risk of prison escapes while maintaining the utility and flexibility of these models.”

It is unlikely that LLM will ever be completely immune to jailbreaks and hallucinations, as new research has shown that generative AI models are susceptible to a form of “packet confusion” where they could recommend non-existent packages developers.

This can have the unfortunate side effect of fueling attacks on software supply chains, where attackers create hallucinatory packages, fill them with malware, and send them to open source repositories.

“The average percentage of hallucinating packages is at least 5.2% for commercial models and 21.7% for open source models, including a staggering 205,474 unique examples of hallucinating package names, further highlighting the severity and prevalence of this threats”, – researchers. said.

Did you find this article interesting? Follow us Twitter and LinkedIn to read more exclusive content we publish.

Source link

What's Hot

Are you forgotten accounts of advertising services that leave you risk?

New Flodrix Botnet Option Operates Langflow Ai Server RCE BUG to launch DDOS ATTACKS

Lack of the TP-Link Cve-2023-33538 router under active operation, CISA releases an immediate warning

Researchers have uncovered a “deceptive fascination” method for hacking artificial intelligence models

Are you forgotten accounts of advertising services that leave you risk?

New Flodrix Botnet Option Operates Langflow Ai Server RCE BUG to launch DDOS ATTACKS

Lack of the TP-Link Cve-2023-33538 router under active operation, CISA releases an immediate warning

Meta begins showing advertisements on WhatsApp after 6 years delay with the 2018 announcement

The United States seizes $ 7.74 million with a crystallian -related IT workers of North Korea

Anubis Ransomware encrypts files and napkins, making recovery impossible even after payment

Are you forgotten accounts of advertising services that leave you risk?

New Flodrix Botnet Option Operates Langflow Ai Server RCE BUG to launch DDOS ATTACKS

Lack of the TP-Link Cve-2023-33538 router under active operation, CISA releases an immediate warning

Meta begins showing advertisements on WhatsApp after 6 years delay with the 2018 announcement

The United States seizes $ 7.74 million with a crystallian -related IT workers of North Korea

Anubis Ransomware encrypts files and napkins, making recovery impossible even after payment

Turning Cybersecurity Practice into Mrr Machine

Malicious Pypi Masquerade Package as chimera module for theft Aws, CI/CD and MacOS

Our Picks

Are you forgotten accounts of advertising services that leave you risk?

New Flodrix Botnet Option Operates Langflow Ai Server RCE BUG to launch DDOS ATTACKS

Lack of the TP-Link Cve-2023-33538 router under active operation, CISA releases an immediate warning

Most Popular

In Indonesia, crippling immigration ransomware breach sparks privacy crisis

Why Indonesia’s Data Breach Crisis Calls for Better Security

Indonesia’s plan to integrate 27,000 govt apps in one platform welcomed but data security concerns linger

What's Hot

Researchers have uncovered a “deceptive fascination” method for hacking artificial intelligence models

Related Posts