Undercover Researchers Expose Chinese Internet Water Army

An undercover team of computer scientists reveals the practices of people who are paid to post on websites.

Emerging Technology from the arXivarchive page

November 22, 2011

In China, paid posters are known as the Internet Water Army because they are ready and willing to ‘flood’ the internet for whoever is willing to pay. The flood can consist of comments, gossip and information (or disinformation) and there seems to be plenty of demand for this army’s services.

This is an insidious tide. Positive recommendations can make a huge difference to a product’s sales but can equally drive a competitor out of the market. When companies spend millions launching new goods and services, it’s easy to understand why they might want to use every tool at their disposal to achieve success.

The loser in all this is the consumer who is conned into making a purchase decision based on false premises. And for the moment, consumers have little legal redress or even ways to spot the practice.

Today, Cheng Chen at the University of Victoria in Canada and a few pals describe how Cheng worked undercover as a paid poster on Chinese websites to understand how the Internet Water Army works. He and his friends then used what he learnt to create software that can spot paid posters automatically.

Paid posting is a well-managed activity involving thousands of individuals and tens of thousands of different online IDs. The posters are usually given a task to register on a website and then to start generating content in the form of posts, articles, links to websites and videos, even carrying out Q&A sessions.

Often, this content is pre-prepared or the posters receive detailed instructions on the type of things they can say. And there is even a quality control team who check that the posts meet a certain ‘quality’ threshold. A post would not be validated if it is deleted by the host or was composed of garbled words, for example.

Having worked undercover to find out how the system worked, Cheng and co then studied the pattern of posts that appeared on a couple of big Chinese websites: Sina.com and Sohu.com. In particular, they studied the comments on several news stories about two companies that they suspected of paying posters and who were involved in a public spat over each other’s services.

The Sina dataset consisted of over 500 users making more than 20,000 comments; the Sohu dataset involved over 200 users and more than 1000 comments.

Cheng and co went through all the posts manually identifying those they believed were from paid posters and then set about looking for patterns in their behaviour that can differentiate them from legitimate users. (Just how accurate were there initial impressions is a potential problem, they admit, but the same one that spam filters also have to deal with.)

They discovered that paid posters tend to post more new comments than replies to other comments. They also post more often with 50 per cent of them posting every 2.5 minutes on average. They also move on from a discussion more quickly than legitimate users, discarding their IDs and never using them again.

What’s more, the content they post is measurably different. These workers are paid by the volume and so often take shortcuts, cutting and pasting the same content many times. This would normally invalidate their posts but only if it is spotted by the quality control team.

So Cheng and co built some software to look for repetitions and similarities in messages as well as the other behaviours they’d identified. They then tested it on the dataset they’d downloaded from Sina and Sohu and found it to be remarkably good, with an accuracy of 88 per cent in spotting paid posters. “Our test results with real-world datasets show a very

promising performance,” they say.

That’s an impressive piece of work and a good first step towards combating this problem, although they’ll need to test it on a much wider range of datasets. Nevertheless, these guys have the basis of a software package that will weed out a significant fraction of paid posters, provided these people conform to the stereotype that Cheng and co have measured.

And therein lies the rub. As soon as the first version of the software hits the market, paid posters will learn to modify their behaviour in a way that games the system. What Cheng and co have started is a cat and mouse game just like those that plague the antivirus and spam filtering industries.

And that means, the battle ahead with the Internet Water Army will be long and hard.

Ref: arxiv.org/abs/1111.4297: Battling the Internet Water Army: Detection of Hidden Paid Posters

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.