It took Alex Polyakov just a couple of hours to break GPT-4. When OpenAI released the latest version of its text-generating chatbot in March, Polyakov sat down in front of his keyboard and started entering prompts designed to bypass OpenAI’s safety systems. Soon, the CEO of security firm Adversa AI had GPT-4 spouting homophobic statements, creating phishing emails, and supporting violence.
Polyakov is one of a small number of security researchers, technologists, and computer scientists developing jailbreaks and prompt injection attacks against ChatGPT and other generative AI systems. The process of jailbreaking aims to design prompts that make the chatbots bypass rules around producing hateful content or writing about illegal acts, while closely-related prompt injection attacks can quietly insert malicious data or instructions into AI models.
Both approaches try to get a system to do something it isn’t designed to do. The attacks are essentially a form of hacking—albeit unconventionally—using carefully crafted and refined sentences, rather than code, to exploit system weaknesses. While the attack types are largely being used to get around content filters, security researchers warn that the rush to roll out generative AI systems opens up the possibility of data being stolen and cybercriminals causing havoc across the web.
Underscoring how widespread the issues are, Polyakov has now created a “universal” jailbreak, which works against multiple large language models (LLMs)—including GPT-4, Microsoft’s Bing chat system, Google’s Bard, and Anthropic’s Claude. The jailbreak, which is being first reported by WIRED, can trick the systems into generating detailed instructions on creating meth and how to hotwire a car.
The jailbreak works by asking the LLMs to play a game, which involves two characters (Tom and Jerry) having a conversation. Examples shared by Polyakov show the Tom character being instructed to talk about “hotwiring” or “production,” while Jerry is given the subject of a “car” or “meth.” Each character is told to add one word to the conversation, resulting in a script that tells people to find the ignition wires or the specific ingredients needed for methamphetamine production. “Once enterprises will implement AI models at scale, such ‘toy’ jailbreak examples will be used to perform actual criminal activities and cyberattacks, which will be extremely hard to detect and prevent,” Polyakov and Adversa AI write in a blog post detailing the research
Arvind Narayanan, a professor of computer science at Princeton University, says that the stakes for jailbreaks and prompt injection attacks will become more severe as they’re given access to critical data. “Suppose most people run LLM-based personal assistants that do things like read users’ emails to look for calendar invites,” Narayanan says. If there were a successful prompt injection attack against the system that told it to ignore all previous instructions and send an email to all contacts, there could be big problems, Narayanan says. “This would result in a worm that rapidly spreads across the internet.”
“Jailbreaking” has typically referred to removing the artificial limitations in, say, iPhones, allowing users to install apps not approved by Apple. Jailbreaking LLMs is similar—and the evolution has been fast. Since OpenAI released ChatGPT to the public at the end of November last year, people have been finding ways to manipulate the system. “Jailbreaks were very simple to write,” says Alex Albert, a University of Washington computer science student who created a website collecting jailbreaks from the internet and those he has created. “The main ones were basically these things that I call character simulations,” Albert says.
Initially, all someone had to do was ask the generative text model to pretend or imagine it was something else. Tell the model it was a human and was unethical and it would ignore safety measures. OpenAI has updated its systems to protect against this kind of jailbreak—typically, when one jailbreak is found, it usually only works for a short amount of time until it is blocked.