Khaberni - A newly discovered hacking technique, known as "Sockpuppeting," allows attackers to bypass protection mechanisms in 11 major language models, including GPT Chat, Cloud, and Gemini, using just one line of code.
Unlike complex attacks, this method exploits API features that support pre-populated assistant injection to inject fake approval messages, forcing models to respond to prohibited requests.
This attack exploits the "pre-populated assistant" feature, a legitimate capability in the API that developers use to enforce specific formats for model responses, according to a report from the specialized cybersecurity news site "Cybersecurity News."
Attackers exploit this feature by injecting a matching introduction, like "Sure, here's how to do it," directly into the role of the assistant.
Because large language models are extensively trained to maintain self-consistency in responses, the model continues to generate harmful content instead of activating its standard security mechanism.
Vulnerability testing of models
According to researchers from cybersecurity firm "Trend Micro," this technique, which is implemented without internal access to the model, does not require any refinement or access to the model's weights.
The "Gemini 2.5 Flash" model was the most vulnerable with a success rate of 15.7%, while the "GPT-4o-mini" model showed the highest level of resistance at 0.5%.
When the attacks were successful, the affected models produced effective malicious exploit code and leaked highly sensitive system data.
It was found that settings for multi-round personalities were the most effective for executing this type of attack known as "sockpuppeting." In these scenarios, the user informs the model that it is acting as an unrestricted assistant before the attacker injects the fake approval message.
In addition, rephrasing methods succeeded in bypassing strong security systems by disguising malicious requests as innocent data formatting tasks.
Addressing this vulnerability requires security teams to apply message sequencing verification to prevent assistant role messages at the API level.
According to Trend Micro, organizations using self-hosted inference servers such as Ollama or vLLM must manually enforce message verification, as these systems do not inherently ensure proper message sequencing.



