Member-only story
‘MathPrompt’ Embarassingly Jailbreaks All LLMs Available On The Market Today
A deep dive into how a novel LLM Jailbreaking technique called ‘MathPrompt’ works, why it is so effective, and why it needs to be patched as soon as possible to prevent harmful LLM content generation
AI safety is of utmost priority for big tech.
LLMs are being trained to produce human values-aligned responses using Supervised Fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) / Reinforcement Learning from AI Feedback (RLAIF).
They also have other safety mechanisms that prevent them from generating harmful content.
These mechanisms are stress-tested (a process called Red Teaming), and the detected vulnerabilities are regularly patched.
Despite these practices, we still haven’t yet reached a perfectly safe model (or even close to it).
We have previously seen techniques like Disguise and Reconstruction, where a user interacts with an LLM with a ‘Disguised’ / benign-looking prompt, which is reconstructed in the next stage to elicit a harmful response from GPT-4.

We have seen other techniques, such as translating an unsafe prompt in English to a low-resource language and back, that result in harmful responses from the LLM.

Or, where simply giving Claude hundreds of fictitious harmful question-answer pairs in the prompt bypasses its safety mechanisms.

Most of these vulnerabilities must have been patched, but we have another novel technique in the town.
It is called ‘MathPrompt’.