“The advent of the internet and the subsequent information explosion has made it increasingly challenging for journalists to produce news accurately and swiftly.” So begin the research and development team at the global news agency Reuters in a paper on the arXiv this week.
For Reuters, the problem has been made more acute by the emergence of fake news as an important factor in distorting the perception of events.
Nevertheless, news agencies such as the Associated Press have moved ahead with automated news writing services. These report standard announcements such as financial news and certain sports results by pasting the data into pre-written templates: “X reported profit of Y million in Q3, in results that beat Wall Street forecasts ... ”
So there is significant pressure on other news agencies to automate news production. And today, Reuters outlines how it has almost entirely automated the identification of breaking news stories. Xiaomo Liu and pals at Reuters Research and Development and Alibaba say the new system performs well. Indeed, it has the potential to revolutionize the news business. But it also raises concerns about how such a system could be gamed by malicious actors.
The new system is called Reuters Tracer. It uses Twitter as a kind of global sensor that records news events as they are happening. The system then uses various kinds of data mining and machine learning to pick out the most relevant events, determine their topic, rank their priority, and write a headline and a summary. The news is then distributed around the company’s global news wire.
The first step in the process is to siphon the Twitter data stream. Tracer examines about 12 million tweets a day, 2 percent of the total. Half of these are sampled at random; the other half come from a list of Twitter accounts curated by Reuters’s human journalists. They include the accounts of other news organizations, significant companies, influential individuals, and so on.
The next stage is to determine when a news event has occurred. Tracer does this by assuming that an event has occurred if several people start talking about it at once. So it uses a clustering algorithm to find these conversations.
Of course, these clusters include spam, advertisements, ordinary chat, and so on. Only some of them refer to newsworthy events.
So the next stage is to classify and prioritize the events. Tracer uses a number of algorithms to do this. The first identifies the topic of the conversation. It then compares this with a database of topics that the Reuters team has gathered from tweets produced by 31 official news accounts, such as @CNN, @BBCBreaking, and @nytimes as well as news aggregators like @BreakingNews.
At this stage, the algorithm also determines the location of the event using a database of cities and location-based keywords.
Once a conversation or rumor is potentially identified as news, an important consideration is its veracity. To determine this, Tracer looks for the source by identifying the earliest tweet in the conversation that mentions the topic and any sites it points to. It then consults a database listing known producers of fake news, such as the National Report, or satirical news sites such as The Onion.
Finally, the system writes a headline and summary and distributes the news throughout the Reuters organization.
During trials, the Reuters team say, the system has performed well. “Tracer is able to achieve competitive precision, recall, timeliness, and veracity on news detection and delivery,” they say.
And they have stats to back this up. The system processes 12 million tweets every day, rejecting almost 80 percent of them as noise. The rest fall into about 6,000 clusters that the system categorizes as different types of news events. That’s all done by 13 servers running 10 different algorithms.
By comparison, Reuters employs some 2,500 journalists around the world who together generate about 3,000 news alerts every day, using a variety of sources, including Twitter. Of these, around 250 are written up as news stories.
Reuters compared the stories that Tracer identifies with those that appear in the news feeds of organizations like the BBC and CNN. “The results indicate Tracer can cover about 70 percent of news stories with 2 percent of Twitter data,” say Lui and co.
And the system certainly works quickly. The team highlight the example of the Las Vegas shooting in October 2017, which left 58 people dead. A witness reported the incident at 1:22 a.m., which triggered a Tracer cluster. However, the cluster did not meet the system’s criteria for an event to be included in the news feed until 1:39 a.m. “Reuters reported the incident at 1:49 a.m.,” say Lui and co.
That’s interesting work that raises a number of questions, especially about how easy the system is to manipulate. It’s not hard to imagine malicious actors designing Twitter feeds with the specific intent of fooling Tracer.
But whether this system will be easier to game than the current one, in which humans are regularly tricked, is hard to say.
Then there is the role of humans in the news business. The future of news is clearly one of increasing automation. How humans fit in is yet to be determined.
Ref: arxiv.org/abs/1711.04068 : Reuters Tracer: Toward Automated News Production Using Large Scale Social Media Data