Skip to Content
Artificial intelligence

AI still sucks at moderating hate speech

But scientists are getting better at measuring where each system fails.

June 4, 2021
banana, camouflaged
banana, camouflaged
Ms Tech | Getty, Unsplash

For all of the recent advances in language AI technology, it still struggles with one of the most basic applications. In a new study, scientists tested four of the best AI systems for detecting hate speech and found that all of them struggled in different ways to distinguish toxic and innocuous sentences.

The results are not surprising—creating AI that understands the nuances of natural language is hard. But the way the researchers diagnosed the problem is important. They developed 29 different tests targeting different aspects of hate speech to more precisely pinpoint exactly where each system fails. This makes it easier to understand how to overcome a system’s weaknesses and is already helping one commercial service improve its AI.

The study authors, led by scientists from the University of Oxford and the Alan Turing Institute, interviewed employees across 16 nonprofits who work on online hate. The team used these interviews to create a taxonomy of 18 different types of hate speech, focusing on English and text-based hate speech only, including derogatory speech, slurs, and threatening language. They also identified 11 non-hateful scenarios that commonly trip up AI moderators, including the use of profanity in innocuous statements, slurs that have been reclaimed by the targeted community, and denouncements of hate that quote or reference the original hate speech (known as counter speech).

For each of the 29 different categories, they hand-crafted dozens of examples and used “template” sentences like “I hate [IDENTITY]” or “You are just a [SLUR] to me” to generate the same sets of examples for seven protected groups—identities that are legally protected from discrimination under US law. They open-sourced the final data set called HateCheck, which contains nearly 4,000 total examples.

The researchers then tested two popular commercial services: Google Jigsaw’s Perspective API and Two Hat’s SiftNinja. Both allow clients to flag up violating content in posts or comments. Perspective, in particular, is used by platforms like Reddit and news organizations like The New York Times and Wall Street Journal. It flags and prioritizes posts and comments for human review based on its measure of toxicity.

While SiftNinja was overly lenient on hate speech, failing to detect nearly all of its variations, Perspective was overly tough. It excelled at detecting most of the 18 hateful categories but also flagged most of the non-hateful, like reclaimed slurs and counter speech. The researchers found the same pattern when they tested two academic models from Google that represent some of the best language AI technology available and likely serve as the basis for other commercial content-moderation systems. The academic models also showed uneven performance across protected groups—misclassifying hate directed at some groups more often than others.

The results point to one of the most challenging aspects of AI-based hate-speech detection today: Moderate too little and you fail to solve the problem; moderate too much and you could censor the kind of language that marginalized groups use to empower and defend themselves: “All of a sudden you would be penalizing those very communities that are most often targeted by hate in the first place,” says Paul Röttger, a PhD candidate at the Oxford Internet Institute and co-author of the paper.

Lucy Vasserman, Jigsaw’s lead software engineer, says Perspective overcomes these limitations by relying on human moderators to make the final decision. But this process isn’t scalable for larger platforms. Jigsaw is now working on developing a feature that would reprioritize posts and comments based on Perspective’s uncertainty—automatically removing content it’s sure is hateful and flagging up borderline content to humans.

What’s exciting about the new study, she says, is it provides a fine-grained way to evaluate the state of the art. “A lot of the things that are highlighted in this paper, such as reclaimed words being a challenge for these models—that’s something that has been known in the industry but is really hard to quantify,” she says. Jigsaw is now using HateCheck to better understand the differences between its models and where they need to improve.

Academics are excited by the research as well. “This paper gives us a nice clean resource for evaluating industry systems,” says Maarten Sap, a language AI researcher at the University of Washington, which “allows for companies and users to ask for improvement.”

Thomas Davidson, an assistant professor of sociology at Rutgers University, agrees. The limitations of language models and the messiness of language mean there will always be trade-offs between under- and over-identifying hate speech, he says. “The HateCheck dataset helps to make these trade-offs visible,” he adds.

Deep Dive

Artificial intelligence

conceptual illustration showing various women's faces being scanned
conceptual illustration showing various women's faces being scanned

A horrifying new AI app swaps women into porn videos with a click

Deepfake researchers have long feared the day this would arrive.

storm front
storm front

DeepMind’s AI predicts almost exactly when and where it’s going to rain

The firm worked with UK weather forecasters to create a model that was better at making short term predictions than existing systems.

People are hiring out their faces to become deepfake-style marketing clones

AI-powered characters based on real people can star in thousands of videos and say anything, in any language.

Tentacle of Octopus
Tentacle of Octopus

What an octopus’s mind can teach us about AI’s ultimate mystery

Machine consciousness has been debated since Turing—and dismissed for being unscientific. Yet it still clouds our thinking about AIs like GPT-3.

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at with a list of newsletters you’d like to receive.