A watermark for chatbots can expose text written by an AI

The tool could let teachers spot plagiarism or help social media platforms fight disinformation bots.

Melissa Heikkiläarchive page

January 27, 2023

Getty Images

Hidden patterns purposely buried in AI-generated texts could help identify them as such, allowing us to tell whether the words we’re reading are written by a human or not.

These “watermarks” are invisible to the human eye but let computers detect that the text probably comes from an AI system. If embedded in large language models, they could help prevent some of the problems that these models have already caused.

For example, since OpenAI’s chatbot ChatGPT was launched in November, students have already started cheating by using it to write essays for them. News website CNET has used ChatGPT to write articles, only to have to issue corrections amid accusations of plagiarism. Building the watermarking approach into such systems before they’re released could help address such problems.

In studies, these watermarks have already been used to identify AI-generated text with near certainty. Researchers at the University of Maryland, for example, were able to spot text created by Meta’s open-source language model, OPT-6.7B, using a detection algorithm they built. The work is described in a paper that’s yet to be peer-reviewed, and the code will be available for free around February 15.

AI language models work by predicting and generating one word at a time. After each word, the watermarking algorithm randomly divides the language model’s vocabulary into words on a “greenlist” and a “redlist” and then prompts the model to choose words on the greenlist.

The more greenlisted words in a passage, the more likely it is that the text was generated by a machine. Text written by a person tends to contain a more random mix of words. For example, for the word “beautiful,” the watermarking algorithm could classify the word “flower” as green and “orchid” as red. The AI model with the watermarking algorithm would be more likely to use the word “flower” than “orchid,” explains Tom Goldstein, an assistant professor at the University of Maryland, who was involved in the research.

ChatGPT is one of a new breed of large language models that generate text so fluent it could be mistaken for human writing. These AI models regurgitate facts confidently but are notorious for spewing falsehoods and biases. To the untrained eye, it can be almost impossible to distinguish a passage written by an AI model from one written by a human. The breathtaking speed of AI development means that new, more powerful models quickly make our existing tool kit for detecting synthetic text less effective. It’s a constant race between AI developers to build new safety tools that can match the latest generation of AI models.

“Right now, it’s the Wild West,” says John Kirchenbauer, a researcher at the University of Maryland, who was involved in the watermarking work. He hopes watermarking tools might give AI-detection efforts the edge. The tool his team has developed could be adjusted to work with any AI language model that predicts the next word, he says.

The findings are both promising and timely, says Irene Solaiman, policy director at AI startup Hugging Face, who worked on studying AI output detection in her previous role as an AI researcher at OpenAI, but was not involved in this research.

“As models are being deployed at scale, more people outside the AI community, likely without computer science training, will need to access detection methods,” says Solaiman.

There are limitations to this new method, however. Watermarking only works if it is embedded in the large language model by its creators right from the beginning. Although OpenAI is reputedly working on methods to detect AI-generated text, including watermarks, the research remains highly secretive. The company doesn’t tend to give external parties much information about how ChatGPT works or was trained, much less access to tinker with it. OpenAI didn’t immediately respond to our request for comment.

It’s also unclear how the new work will apply to other models besides Meta’s, such as ChatGPT, Solaiman says. The AI model the watermark was tested on is also smaller than popular models like ChatGPT.

More testing is needed to explore different ways someone might try to fight back against watermarking methods, but the researchers say that attackers’ options are limited. “You’d have to change about half the words in a passage of text before the watermark could be removed,” says Goldstein.

“It’s dangerous to underestimate high schoolers, so I won’t do that,” Solaiman says. “But generally the average person will likely be unable to tamper with this kind of watermark.”