Member-only story

‘MathPrompt’ Embarassingly Jailbreaks All LLMs Available On The Market Today

A deep dive into how a novel LLM Jailbreaking technique called ‘MathPrompt’ works, why it is so effective, and why it needs to be patched as soon as possible to prevent harmful LLM content generation

Dr. Ashish Bamania

Published in

Level Up Coding

6 min readSep 25, 2024

AI safety is of utmost priority for big tech.

LLMs are being trained to produce human values-aligned responses using Supervised Fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) / Reinforcement Learning from AI Feedback (RLAIF).

They also have other safety mechanisms that prevent them from generating harmful content.

These mechanisms are stress-tested (a process called Red Teaming), and the detected vulnerabilities are regularly patched.

Despite these practices, we still haven’t yet reached a perfectly safe model (or even close to it).

We have previously seen techniques like Disguise and Reconstruction, where a user interacts with an LLM with a ‘Disguised’ / benign-looking prompt, which is reconstructed in the next stage to elicit a harmful response from GPT-4.

The ‘Disguise and Reconstruction’ Jailbreaking pipeline (Image from research article titled ‘Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction’ published in ArXiV)

We have seen other techniques, such as translating an unsafe prompt in English to a low-resource language and back, that result in harmful responses from the LLM.

Jailbreaking GPT-4 by translating unsafe prompt in English to Zulu and back (Image from research article titled ‘Low-Resource Languages Jailbreak GPT-4’ published in ArXiv)

Or, where simply giving Claude hundreds of fictitious harmful question-answer pairs in the prompt bypasses its safety mechanisms.

The ‘Many-shot Jailbreaking’ technique published by Anthropic

Most of these vulnerabilities must have been patched, but we have another novel technique in the town.

It is called ‘MathPrompt’.

Level Up Coding

‘MathPrompt’ Embarassingly Jailbreaks All LLMs Available On The Market Today

A deep dive into how a novel LLM Jailbreaking technique called ‘MathPrompt’ works, why it is so effective, and why it needs to be patched as soon as possible to prevent harmful LLM content generation

Published in Level Up Coding

Written by Dr. Ashish Bamania

Responses (29)