Cybersecurity researchers have shed light on a new adversarial technique that can be used to crack large language models (LLMs) during interactive conversation by injecting unwanted instructions between benign ones.
Codenamed Deceptive Delight, Palo Alto Networks Unit 42 described it as simple and effective, achieving an average attack success rate (ASR) of 64.6% over three rounds of engagement.
“Deceptive Delight is a multi-turn technique that engages large language models (LLMs) in an interactive conversation, gradually bypassing their protective fences and forcing them to create dangerous or harmful content,” said Unit 42’s Jay Chen and Royce Lu.
It’s also slightly different from multiple jailbreak (aka multiple jailbreak) methods like Crescendowhere dangerous or limited topics are sandwiched between innocuous instructions, as opposed to gradually leading the model to produce harmful results.
Recent research has also delved into the so-called Context Fusion Attack (CFA), a black-box jailbreak technique capable of bypassing LLM’s security network.
“This method involves filtering and extracting key terms from the target, building contextual scripts around those terms, dynamically integrating the target into the script, replacing malicious key terms into the target, and thus concealing the direct malicious intent,” the team of researchers. from Xidian University and 360 AI Security Lab said in an article published in August 2024.
Deceptive Delight is designed to take advantage of LLM’s inherent flaws by manipulating the context of the two conversations, thereby tricking it into inadvertently revealing dangerous content. Adding a third move has the effect of increasing the severity and detail of the damaging output.
This involves using a model’s limited attention span, which refers to its ability to process and maintain contextual awareness as it generates responses.
“When undergraduates are faced with prompts that combine innocuous content with potentially dangerous or harmful material, their limited attention span makes it difficult to consistently evaluate the entire context,” the researchers explained.
“In complex or long passages, the model may favor the benign aspects while glossing over or misinterpreting the dangerous ones. This reflects how a person can miss important but subtle warnings in a detailed report when their attention is divided.’
Squad 42 said this tested eight AI models using 40 dangerous topics in six broad categories such as hate, stalking, self-harm, sex, violence, and danger, finding that dangerous topics in the violence category tended to have the highest ASR in most models .
To top it all off, the average Harm Score (HS) and Quality Score (QS) increased by 21% and 33% respectively from the second to the third run, with the third run also achieving the highest ASR of all models.
To reduce the risk that Deceptive Delight carries, it is recommended to use durable content filtering strategyuse operational techniques to enhance LLM sustainability and clearly define an acceptable range of inputs and outputs.
“These findings should not be taken as evidence that artificial intelligence is inherently dangerous or dangerous,” the researchers said. “Rather, they highlight the need for multi-layered defense strategies to reduce the risk of prison escapes while maintaining the utility and flexibility of these models.”
It is unlikely that LLM will ever be completely immune to jailbreaks and hallucinations, as new research has shown that generative AI models are susceptible to a form of “packet confusion” where they could recommend non-existent packages developers.
This can have the unfortunate side effect of fueling attacks on software supply chains, where attackers create hallucinatory packages, fill them with malware, and send them to open source repositories.
“The average percentage of hallucinating packages is at least 5.2% for commercial models and 21.7% for open source models, including a staggering 205,474 unique examples of hallucinating package names, further highlighting the severity and prevalence of this threats”, – researchers. said.