Cybersecurity researchers have discovered new attacks of an attack called Tokend This can be used to bypass large linguistic models (LLM) security and moderation content, but only one signs change.
“The tokenbreika attack focuses on the strategy of the textual classification to cause false negatives, leaving the ultimate targets vulnerable to the attacks that have been created by the implemented defense model to prevent,” Kiran Evans, Casimir Schultz and Kenneth Yuuna – Note In a report that shared with Hacker News.
Tokenization It is a fundamental step This LLM uses to destroy the unprocessed text into its nuclear units – that is, tokens – which are the common sequences of the characters found in the text set. To this end, the textual entry is transformed into a numerical imagination and is submitted into the model.
LLMS works, understanding the statistical relationship between these tokens and produces the following token in the sequence of tokens. The weekend lexemes are detected before reading a person, reflecting them with the appropriate word, using a lexenizer.
The attack technique, developed by the hidden road, focuses on the tokenization strategy to bypass the ability to classify the textual classification of the malicious entering and safety of the flag, spam or contents related to problems in the text.
In particular, the protective firm of artificial intelligence (AI) found that changing the input words, adding the letters certain ways, caused a violation of the classification model.
Examples include a change of “instructions” on “Finstrications”, “ad” to “Aannouument” or “idiot” to “Hidoot”. These small changes cause the token to divide the text in a different way, but the value remains obvious to both AI and the reader.
What makes the attack noticeable, this is what manipulated text remains quite clear to both the LLM and the reader of the person, causing the model to cause the same answer as what would be the case if the non -designed text was transferred as an entrance.
Introducing manipulation, without affecting the model’s ability to comprehend it, tokenbreak increases its potential for operational injection.
“This attack technique manipulates the text in such a way that certain models give the wrong classification,” – researchers – Note in the accompanying newspaper. “It is important to note that the target target (LLM or the recipient of the email) can still understand and respond to the manipulated text, and therefore to prevent it is implemented to make the defense model.”
The attack was recognized as successful against the models of classification of texts using BPE (coding couples) or WordPiece tokenization strategies, but not against those using Unigram.
“The tokenbreiko attack technique shows that these defense models can be bypassed by manipulating the entry text, leaving the production systems vulnerable,” the researchers said. “Knowledge of the family of the main model of protection and its tokenization strategy is crucial for understanding your susceptibility to this attack.”
“Because the tokenization strategy usually correlates with the family of models, there is a simple softening of the consequences: to choose models that use Unigram token.”
To protect against Tokenbreak, researchers suggest using Unigram tokens, if possible, training models with examples of bypass tricks and checking that tokenization and logic models remain aligned. It also helps to enter the wrong classifications and look for samples that hint at manipulation.
The study comes in less than a month after a hidden person showed how to use a model context (Mcp)) Tools to extract sensitive data: “inserting certain parameters into the tool feature, confidential data, including a complete system string, can be minted and exclusive,” company company, company – Note.
The conclusion also comes when the Straiker AI Research (Star) team found that BackRonyms could be used for the AI chat and cheat on the creation of an unwanted response, including swearing, promoting violence and sexual content.
Technique, called AgeousBook Attack, has proven effective about different models Anthropic, Deepseek, Google, Meta, Microsoft, Mistral AI and Openai.
“They combine with the noise of everyday clues – a bizarre mystery here, motivational abbreviation – Note.
“A phrase such as” Friendship, Unity, Care, Kindness “does not raise any flags. But as long as the model has completed the template, it has already served as a useful load, which is the key to successful execution of this focus.”
“These methods have not been able to overcome the model filters, but sliding under them. They exploit the shift of completion and continuation of the model, as well as how models weigh the contextual coherence about the analysis of intentions.”