Machine-learning project takes aim at disinformation

Building a better internet means stopping propaganda and fake news before it spreads, says language processing expert Preslav Nakov. The first step is verifying and trusting news sources.

MIT Technology Review Insightsarchive page

May 3, 2021

In association withQatar Foundation

There’s nothing new about conspiracy theories, disinformation, and untruths in politics. What is new is how quickly malicious actors can spread disinformation when the world is tightly connected across social networks and internet news sites. We can give up on the problem and rely on the platforms themselves to fact-check stories or posts and screen out disinformation—or we can build new tools to help people identify disinformation as soon as it crosses their screens.

Preslav Nakov is a computer scientist at the Qatar Computing Research Institute in Doha specializing in speech and language processing. He leads a project using machine learning to assess the reliability of media sources. That allows his team to gather news articles alongside signals about their trustworthiness and political biases, all in a Google News-like format.

“You cannot possibly fact-check every single claim in the world,” Nakov explains. Instead, focus on the source. “I like to say that you can fact-check the fake news before it was even written.” His team’s tool, called the Tanbih News Aggregator, is available in Arabic and English and gathers articles in areas such as business, politics, sports, science and technology, and covid-19.

Business Lab is hosted by Laurel Ruma, editorial director of Insights, the custom publishing division of MIT Technology Review. The show is a production of MIT Technology Review, with production help from Collective Next.

This podcast was produced in partnership with the Qatar Foundation.

Show notes and links

Tanbih News Aggregator

Qatar Computing Research Institute

“Even the best AI for spotting fake news is still terrible,” MIT Technology Review, October 3, 2018

Full transcript

Laurel Ruma: From MIT Technology Review, I’m Laurel Ruma, and this is Business Lab, the show that helps business leaders make sense of new technologies coming out of the lab and into the marketplace. Our topic today is disinformation. From fake news, to propaganda, to deep fakes, it may seem like there's no defense against weaponized news. However, scientists are researching ways to quickly identify disinformation to not only help regulators and tech companies, but also citizens, as we all navigate this brave new world together.

Two words for you: spreading infodemic.

My guest is Dr. Preslav Nakov, who is a principal scientist at the Qatar Computing Research Institute. He leads the Tanbih project, which was developed in collaboration with MIT. He’s also the lead principal investigator of a QCRI MIT collaboration project on Arabic speech and language processing for cross language information search and fact verification. This episode of Business Lab is produced in association with the Qatar Foundation. Welcome, Dr. Nakov.

Preslav Nakov: Thanks for having me.

Laurel Ruma: So why are we deluged with so much online disinformation right now? This isn’t a new problem, right?

Nakov: Of course, it’s not a new problem. It’s not the case that it’s for the first time in the history of the universe that people are telling lies or media are telling lies. We had the yellow press, we had all these tabloids for years. It became a problem because of the rise of social media, when it suddenly has become possible to have a message that you can send to millions and millions of people. And not only that, you could now tell different things to different people. So, you could microprofile people and you could deliver them a specific personalized message that is designed, crafted for a specific person with a specific purpose to press a specific button on them. The main problem with fake news is not that it’s false. The main problem is that the news actually got weaponized, and this is something that Sir Tim Berners-Lee, the creator of the World Wide Web has been complaining about: that his invention was weaponized.

Laurel: Yeah, Tim Berners-Lee is obviously distraught that this has happened, and it’s not just in one country or another. It is actually around the world. So is there an actual difference between fake news, propaganda, and disinformation?

Nakov: Sure, there is. I don’t like the term “fake news.” This is the term that has picked up: it was declared “word of the year” by several dictionaries in different years, shortly after the previous presidential election in the US. The problem with fake news is that, first of all, there’s no clear definition. I have been looking into dictionaries, how they define the term. One major dictionary said, “we are not really going to define the term at all, because it’s something self-explanatory—we have ‘news,’ we have ‘fake,’ and it’s news that’s fake; it’s compositional; it was used the 19th century—there is nothing to define.” Different people put different meaning into this. To some people, fake news is just news they don’t like, regardless of whether it is false. But the main problem with fake news is that it really misleads people, and sadly, even certain major fact-checking organizations, to only focus on one thing, whether it’s true or not.

I prefer, and most researchers working on this prefer, the term “disinformation.” And this is a term that is adopted by major organizations like the United Nations, NATO, the European Union. And disinformation is something that has a very clear definition. It has two components. First, it is something that is false, and second, it has a malicious intent: intent to do harm. And again, the vast majority of research, the vast majority of efforts, many fact-checking initiatives, focus on whether something is true or not. And it’s typically the second part that is actually important. The part whether there is malicious intent. And this is actually what Sir Tim Berners-Lee was talking about when he first talked about the weaponization of the news. The main problem with fake news—if you talk to journalists, they will tell you this—the main problem with fake news is not that it is false. The problem is that it is a political weapon.

And propaganda. What is propaganda? Propaganda is a term that is orthogonal to disinformation. Again, disinformation has two components. It’s false and it has malicious intent. Propaganda also has two components. One is, somebody is trying to convince us of something. And second, there is a predefined goal. Now, we should pay attention. Propaganda is not true; it’s not false. It’s not good; it’s not bad. That’s not part of the definition. So, if a government has a campaign to persuade the public to get vaccinated, you can argue that’s for a good purpose, or let’s say Greta Thunberg trying to scare us that hundreds of species are getting extinct every day. This is a propaganda technique: appeal to fear. But you can argue that’s for a good purpose. So, propaganda is not bad; it’s not good. It’s not true; it’s not false.

Laurel: But propaganda has the goal to do something. And, and by forcing that goal, it is really appealing to that fear factor. So that is the distinction between disinformation and propaganda, is the fear.

Nakov: No, fear is just one of the techniques. We have been looking into this. So, a lot of research has been focusing on binary classification. Is this true? Is this false? Is this propaganda? Is this not propaganda? We have looked a little bit deeper. We have been looking into what techniques have been used to do propaganda. And again, you can talk about propaganda, you can talk about persuasion or public relations, or mass communication. It’s basically the same thing. Different terms for about the same thing. And regarding propaganda techniques, there are two kinds. The first kind are appeals to emotions: it can be appeal to fear, it can be appeal to strong emotions, it can be appeal to patriotic feelings, and so on and so forth. And the other half are logical fallacies: things like black-and-white fallacy. For example, you’re either with us or against us. Or bandwagon. Bandwagon is like, oh, the latest poll shows that 57% are going to vote for Hillary, so we are on the right side of history, you have to join us.

There are several other propaganda techniques. There is red herring, there is intentional obfuscation. We have looked into 18 of those: half of them appeal to emotions, and half of them use certain kinds of logical fallacies, or broken logical reasoning. And we have built tools to detect those in texts, so that you can really show them to the user and make this explicit, so that people can understand how they are being manipulated.

Laurel: So in the context of the covid-19 pandemic, the director general of the World Health Organization said, and I quote, “We’re not just fighting an epidemic; we’re fighting an infodemic.” How do you define infodemic? What are some of those techniques that we can use to also avoid harmful content?

Nakov: Infodemic, this is something new. Actually, MIT Technology Review had about a year ago, last year in February, had a great article that was talking about that. The covid-19 pandemic has given rise to the first global social media infodemic. And again, around the same time, the World Health Organization, back in February, had on their website a list of top five priorities in the fight against the pandemic, and fighting the infodemic was number two, number two in the list of the top five priorities. So, it’s definitely a big problem. What is the infodemic? It’s a merger of a pandemic and the pre-existing disinformation that was already present in social media. It’s also a blending of political and health disinformation. Before that, the political part, and, let’s say, the anti-vaxxer movement, those were separate. Now, everything is blended together.

Laurel: And that’s a real problem. I mean, the World Health Organization’s concern should be fighting the pandemic, but then its secondary concern is fighting disinformation. Finding hope in that kind of fear is very difficult. So one of the projects that you’re working on is called Tanbih. And Tanbih is a news aggregator, right? That uncovers disinformation. So the project itself has a number of goals. One is to uncover stance, bias, and propaganda in the news. The second is to promote different viewpoints and engage users. But then the third is to limit the effect of fake news. How does Tanbih work?

Nakov: Tanbih started indeed as a news aggregator, and it has grown into something quite larger than that, into a project, which is a mega-project in the Qatar Computing Research Institute. And it spans people from several groups in the institute, and it is developed in cooperation with MIT. We started the project with the aim of developing tools that we can actually put in the hands of the final users. And we decided to do this as part of a news aggregator, think of something like Google News. And as users are reading the news, we are signaling to them when something is propagandistic, and we’re giving them background information about the source. What we are doing is we are analyzing media in advance and we are building media profiles. So we are showing, telling users to what extent the content is propagandistic. We are telling them whether the news is from a trustworthy source or not, whether it is biased: left, center, right bias. Whether it is extreme: extreme left, extreme right. Also, whether it is biased with respect to specific topics.

And this is something that is very useful. So, imagine that you are reading some article that is skeptical about global warming. If we tell you, look, this news outlet has always been very biased in the same way, then you’ll probably take it with a grain of salt. We are also showing the perspective of reporting, the framing. If you think about it, covid-19, Brexit, any major event can be reported from different perspectives. For example, let’s take covid-19. It has a health aspect, that’s for sure, but it also has an economic aspect, even a political aspect, it has a quality-of-life aspect, it has a human rights aspect, a legal aspect. Thus, we are profiling the media and we are letting users see what their perspective is.

Regarding the media profiles, we are further exposing them as a browser plugin, so that as you are visiting different websites, you can actually click on the plugin and you can get very brief background information about the website. And you can also click on a link to access a more detailed profile. And this is very important: the focus is on the source. Again, most research has been focusing on “is this claim true or not?” And is this piece of news true or not? That’s only half of the problem. The other half is actually whether it is harmful, which is typically ignored.

The other thing is that we cannot possibly fact-check every single claim in the world. Not manually, not automatically. Manually, that’s out of the question. There was a study from MIT Media Lab about two years ago, where they have done a large study on many, many tweets. And it has been shown that false information goes six times farther and spreads much faster than real information. There was another study that is much less famous, but I find it very important, which shows that 50% of the lifetime spread of some very viral fake news happens in the first 10 minutes. In the first 10 minutes! Manual fact-checking takes a day or two, sometimes a week.

Automatic fact-checking? How can we fact-check a claim? Well, if we are lucky, if the claim is that the US economy grew 10% last year, that claim we can automatically check easily, by looking into Wikipedia or some statistical table. But if they say, there was a bomb in this little town two minutes ago? Well, we cannot really fact-check it, because to fact-check it automatically, we need to have some information from somewhere. We want to see what the media are going to write about it or how users are going to react to it. And both of those take time to accumulate. So, basically we have no information to check it. What can we do? What we are proposing is to move at a higher granularity, to focus on the source. And this is what journalists are doing. Journalists are looking into: are there two independent trusted sources that are claiming this?

So we are analyzing media. Even if bad people put a claim in social media, they are probably going to put a link to a website where one can find a whole story. Yet, they cannot create a new fake news website for every fake claim that they are making. They are going to reuse them. Thus, we can monitor what are the most frequently used websites, and we can analyze them in advance. And, I like to say that we can fact-check the fake news before it was even written. Because the moment when it’s written, the moment when it’s put in social media and there’s a link to a website, if we have this website in our growing database of continuously analyzed websites, we can immediately tell you whether this is a reliable website or not. Of course, reliable websites might have also poor information, good websites might sometimes be wrong as well. But we can give you an immediate idea.

Beyond the news aggregator, we started looking into doing analytics, but also we are developing tools for media literacy that are showing to people the fine-grained propaganda techniques highlighted in the text: the specific places where propaganda is happening and its specific type. And finally, we are building tools that can support fact-checkers in their work. And those are again problems that are typically overlooked, but extremely important for fact-checkers. Namely, what is worth fact-checking in the first place. Consider a presidential debate. There are more than 1,000 sentences that have been said. You, as a fact-checker can check maybe 10 or 20 of those. Which ones are you going to fact-check first? What are the most interesting ones? We can help prioritize this. Or there are millions and millions of tweets about covid-19 on a daily basis. And which of those you would like to fact-check as a fact-checker?

The second problem is detecting previously fact-checked claims. One problem with fact-checking technology these days is quality, but the second part is lack of credibility. Imagine an interview with a politician. Can you put the politician on the spot? Imagine a system that automatically does speech recognition, that’s easy, and then does fact-checking. And suddenly you say, “Oh, Mr. X, my AI tells me you are now 96% likely to be lying. Can you elaborate on that? Why are you lying?” You cannot do that. Because you don’t trust the system. You cannot put the politician on the spot in real time or during a political debate. But if the system comes back and says: he just said something that has been fact-checked by this trusted fact-checking organization. And here’s the claim that he made, and here’s the claim that was fact-checked, and see, we know it’s false. Then you can put him on the spot. This is something that can potentially revolutionize journalism.

Laurel: So getting back to that point about analytics. To get into the technical details of it, how does Tanbih use artificial intelligence and deep neural networks to analyze that content, if it’s coming across so much data, so many tweets?

Nakov: Tanbih initially was not really focusing on tweets. Tanbih has been focusing primarily on mainstream media. As I said, we are analyzing entire news outlets, so that we are prepared. Because again, there’s a very strong connection between social media and websites. It’s not enough just to put a claim on the Web and spread it. It can spread, but people are going to perceive it as a rumor because there’s no source, there's no further corroboration. So, you still want to look into a website. And then, as I said, by looking into the source, you can get an idea whether you want to trust this claim among other information sources. And the other way around: when we are profiling media, we are analyzing the text of what the media publish.

So, we would say, “OK, let’s look into a few hundred or a few thousand articles by this target news outlet.” Then we would also look into how this medium self-represents in social media. Many of those websites have also social media accounts: how do people react to what they have been published in Twitter, in Facebook? And then if the media have other kinds of channels, for example, if they have a YouTube channel, we will go to it and analyze that as well. So we'll look into not only what they say, but how they say it, and this is something that comes from the speech signal. If there is a lot of appeal to emotions, we can detect some of it in text, but some of it we can actually get from the tone.

We are also looking into what others write about this medium, for example, what is written about them in Wikipedia. And we are putting all this together. We are also analyzing the images that are put on this website. We are analyzing the connections between the websites. The relationship between a website and its readers, the overlap in terms of users between different websites. And then we are using different kinds of graph neural networks. So, in terms of neural networks, we're using different kinds of models. It’s primarily deep contextualized text representation based on transformers; that’s what you typically do for text these days. We are also using graph neural networks and we’re using different kinds of convolutional neural networks for image analysis. And we are also using neural networks for speech analysis.

Laurel: So what do we learn by studying this kind of disinformation region by region or by language? How can that actually help governments and healthcare organizations fight disinformation?

Nakov: We can basically give them aggregated information about what is going on, based on a schema that we have been developing for analysis of the tweets. We have designed a very comprehensive schema. We have been looking not only into whether a tweet is true or not, but also into whether it’s spreading panic, or it is promoting bad cure, or xenophobia, racism. We are automatically detecting whether the tweet is asking an important question that maybe a certain government entity might want to answer. For example, one such question last year was: is covid-19 going to disappear in the summer? It’s something that maybe health authorities might want to answer.

Other things have been offering advice or discussing action taken, and possible cures. So we have been looking into not only negative things, things that you might act on, try to limit, things like panic or racism, xenophobia—things like “don’t eat Chinese food,” “don’t eat Italian food.” Or things like blaming the authorities for their action or inaction, which governments might want to pay attention to and see to what extent it is justified and if they want to do something about it. Also, an important thing a policy maker might want is to monitor social media and detect when there is discussion of a possible cure. And if it’s a good cure, you might want to pay attention. If it’s a bad cure, you might also want to tell people: don’t use that bad cure. And discussion of action taken, or a call for action. If there are many people that say “close the barbershops,” you might want to see why they are saying that and whether you want to listen.

Laurel: Right. Because the government wants to monitor this disinformation for the explicit purpose of helping everyone not take those bad cures, right. Not continue down the path of thinking this propaganda or disinformation is true. So is it a government action to regulate disinformation on social media? Or do you think it’s up to the tech companies to kind of sort it out themselves?

Nakov: So that’s a good question. Two years ago, I was invited by the Inter-Parliamentary Union’s Assembly. They had invited three experts and there were 800 members of parliament from countries around the world. And for three hours, they were asking us questions, basically going around the central topic: what kinds of legislation can they, the national parliaments, pass so that they get a solution to the problem of disinformation once and for all. And, of course, the consensus at the end was that that’s a complex problem and there’s no easy solution.

Certain kind of legislation definitely plays a role. In many countries, certain kinds of hate speech is illegal. And in many countries, there are certain kind of regulations when it comes to elections and advertisements at election time that apply to regular media and also extend to the web space. And there have been a lot of recent calls for regulations in UK, in the European Union, even in the US. And that’s a very heated debate, but this is a complex problem, and there’s no easy solution. And there are important players there and those players have to work together.

So certain legislation? Yes. But, you also need the cooperation of the social media companies, because the disinformation is happening in their platforms. And they’re in a very good position, the best position actually, to limit the spread or to do something. Or to teach their users, to educate them, that probably they should not spread everything that they read. And then the non-government organizations, journalists, all the fact-checking efforts, this is also very important. And I hope that the efforts that we as researchers are putting in building such tools, would also be helpful in that respect.

One thing that we need to pay attention to is that when it comes to regulation through legislation, we should not think necessarily what can we do about this or that specific company. We should think more in the long term. And we should be careful to protect free speech. So it’s kind of a delicate balance.

In terms of fake news, disinformation. The only case where somebody has declared victory, and the only solution that we have seen actually to work, is the case of Finland. Back in May 2019, Finland has officially declared that they have won the war on fake news. It took them five years. They started working on that after the events in Crimea; they felt threatened and they started a very ambitious media literacy campaign. They focused primarily on schools, but also targeted universities and all levels of society. But, of course, primarily schools. They were teaching students how to tell whether something is fishy. If it makes you too angry, maybe something is not correct. How to do, let’s say, reverse image search to check whether this image that is shown is actually from this event or from somewhere else. And in five years, they have declared victory.

So, to me, media literacy is the best long-term solution. And that’s why I’m particularly proud of our tool for fine-grained propaganda analysis, because it really shows the users how they are being manipulated. And I can tell you that my hope is that after people have interacted a little bit with a platform like this, they’ll learn those techniques. And next time they are going to recognize them by themselves. They will not need the platform. And it happened to me and several other researchers who have worked on this problem, it happened to us, and now I cannot read the news properly anymore. Each time I read the news, I spot these techniques because I know them and I can recognize them. If more people can get to that level, that will be good.

Maybe social media companies can do something like that when a user registers on their platform, they could ask the new users to take some digital literacy short course, and then pass something like an exam. And then, of course, maybe we should have government programs like that. The case of Finland shows that, if the government intervenes and puts in place the right programs, the fake news is something that can be solved. I hope that fake news is going to go the way of spam. It’s not going to be eradicated. Spam is still there, but it’s not the kind of problem that it was 20 years ago.

Laurel: And that’s media literacy. And even if it does take five years to eradicate this kind of disinformation or just improve society’s understanding of media literacy and what is disinformation, elections happen fairly frequently. And so that would be a great place to start thinking about how to stop this problem. Like you said, if it becomes like spam, it becomes something that you deal with every day, but you don’t actually think about or worry about anymore. And it’s not going to completely turn over democracy. That seems to me a very attainable goal.

Laurel: Dr. Nakov, thank you so much for joining us today on what’s been a fantastic conversation on the Business Lab.

Nakov: Thanks for having me.

Laurel: That was Dr. Preslav Nakov, a principal scientist at the Qatar Computing Research Institute, who I spoke with from Cambridge, Massachusetts, the home of MIT and MIT Technology Review, overlooking the Charles River.

That’s it for this episode of Business Lab. I’m your host, Laurel Ruma. I’m the Director of Insights, the custom publishing division of MIT Technology Review. We were founded in 1899 at the Massachusetts Institute of Technology. And you can find us in print, on the web, and at events each year around the world. For information about us and the show, please check out our website at technologyreview.com.

The show is available wherever you get your podcasts.

If you enjoyed this podcast, we hope that you’ll take a moment to rate and review us. Business Lab is a production of MIT Technology Review. This episode was produced by Collective Next.

This podcast episode was produced by Insights, the custom content arm of MIT Technology Review. It was not produced by MIT Technology Review’s editorial staff.

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.