One of the interesting features of modern news coverage is the way that stories spread across the web. Many research teams have attempted to model this process, likening it among other things, to the spread of flu, fashions and forest fires.
One of the fundamental insights that these studies have produced is why these phenomenon spread in similar ways. These phenomenon do not share similar physical properties–a flu virus is not much like a burning leaf or a designer dress.
What these things have in common is the networks on which they spread. It is their environment and the way it is linked together that determines how these events cascade.
The fundamental insight here is that the network of links between trees is similar to the network of contacts between people and the network of links between websites. Because of this, the properties of one network can reasonably be assumed to apply to the others.
Today, Felix Biessmann at the Berlin Institute of Technology in Germany and a few pals study the problem of trend setting among news sites. The question they attempt to answer is which websites lead the news coverage and which ones merely follow.
Their approach is essentially to take a snapshot of the words generated by a group of websites at any instant in time and compare it to the words generated by one of these websites at an earlier time.
This allows them to calculate whether the content of this single website is a good predictor of future content on other websites. In other words, whether it is a trend setter. They then rank the websites according to this metric.
The results are unsurprising. They monitored 96 technology news websites throughout 2011, a process that generated data on some 100,000 words (after common words had been removed).
This is their list of the top trendsetters in technology news coverage:
That’s clearly a list of the biggest and most popular technology news sites on the web.
One problem with this approach is that it fails to differentiate between news generated by current events, such as an earthquake, a product launch or the death of a well known figure like Steve Jobs, and news generated by old fashioned journalistic legwork, like an investigation into child labour abuses or financial irregularities.
The difference being that a big earthquake would get significant media coverage whether or not a particular website covered it, whereas a journalistic expose only gets coverage because of the legwork performed by a particular website.
This is related to another problem. One possibility is that the real trend setters may lie outside the group of 96 websites that these guys have monitored. For example, wires services such as Associated press and Reuters have a huge impact on the spread of news, and many of the bigger websites will have subscriptions to these services.
In this case, the trend setters are simply the ones who post the wires stories first or who post so many of them that they are first often enough to seem like trend setters.
Clearly there’s more work to be done in identifying trend setters. What’s interesting however is that the techniques these and others have developed will have wider application. Trend setters in technology news play a similar role to the first victims in an epidemic who spread the disease, the match or lightning strike that triggers a forest fire and the fashionistas that set clothing trends.
Much work has gone into identifying these too. Perhaps a similar process might help identify the websites that act as ‘matches’ on the web, those sites that trigger each new wave of news.
Ref: arxiv.org/abs/1206.6388: Canonical Trends: Detecting Trend Setters in Web Data