One of the defining features of social content is the way pictures, video and text is shared among many users. Inevitably, some content becomes more popular than others and this leads to cascades in which the number of reshares can be huge. While most pieces of media have only a few shares, some are reshared many millions of times.
So there is much interest in finding out how to predict something that is likely to be popular compared to something that is not. On the face of it, it’s easy to think that predicting the popularity of content is almost impossible. That’s because it depends on so many factors that are difficult to measure, such as the nature of the content and the connectivity of the people who see it.
Nevertheless, various teams have claimed to have found ways of predicting a post’s eventual popularity by analysing its popularity shortly after it is published. But given the absence of any reliable way of doing this on the web, you can judge for yourself just how well these mechanisms must work.
Today, we get a different take on the topic of predictability thanks to the work of Justin Cheng at Stanford University in California as well as a few pals at Facebook and Cornell University. These guys show why popularity is so difficult to predict using the conventional approach of studying the early stages of popularity.
But at the same time, they show that various characteristics of a cascade can be predicted with remarkable accuracy and that this can be used to make successful judgments about the future behaviour of cascades once they have started. The result is a much deeper insight into the nature of cascades than might initially be thought possible.
Cheng and co come to their conclusions by analysing the way photographs were shared on Facebook over a 28 day period following their initial upload in June 2013. The looked over 150,000 photos which were together reshared over 9 million times. The data told them which people (nodes) reshared each photograph and at what time and this allowed them to exactly reconstruct the networks through which the reshares occurred.
In the past, researchers have looked at how large cascades begin and then attempted to use that information to spot large cascades in the future, with mixed results.
Cheng and co take a different approach. They start with a photo that has been reshared a certain number of times, say k. They then determine the likelihood that this photo will be shared twice as many times. In other words, their task is to predict whether the cascade will double in size.
That’s a good choice of question because the distribution of cascade size follows a certain kind of power law. This law ensures that for cascades of a given size, half will more than double in size while the other half will not. So in deciding whether a given cascade will double, a random guess will get the right answer about half of the time.
The question is whether it’s possible to pick out features from the dataset that allow a machine learning algorithm to do better than this. So Cheng and pals use a portion of their data to train a machine learning algorithm to search for features of cascades that make them predictable.
These features include the type of image, whether a close-up or outdoors or having a caption and so on; the number of followers the original poster has; the shape of the cascade that forms, whether a simple star graph or more complex structures; and finally how quickly the cascade takes place, its speed.
Having trained their algorithm, they used it to see whether it could make predictions about other cascades. They started with images that had been shared only five times, so the question was whether they would eventually be shared more than 10 times.
It turns out that this is surprisingly predictable. “For this task, random guessing would obtain a performance of 0.5, while our method achieves surprisingly strong performance: classification accuracy of 0.795,” they say.
And some features of the cascade a much better predictors and others. In fact, temporal performance of the cascade, how fast it spreads, is the best indicator of all. So something spreads quickly to start with, it is likely to spread more.
Another important factor is the topics mentioned in the caption associated with a picture, for example whether newsworthy or associated with a current meme.
Cheng and co also say that it becomes easier to make a prediction as the number of re-shares increases. “This demonstrates that more information is always better: the greater the number of observed reshares, the better the prediction,” they say.
And that’s why previous efforts have largely failed–they always start with too little information.
There are limitations to the work, of course. The most obvious is that it was done only with photos shared entirely within Facebook. It may be that reshares on Facebook are somehow different from those that occur elsewhere on the web and that photos are treated differently from story links, for example.
But Cheng and co are confident that much of what they found will be useful elsewhere. “Despite these limitations, we believe the results give general insights which will be useful in other settings,” they say.
And it leaves much of interest for other researchers to pursue. Cheng and co have stumbled onto a rich vein of insight into the nature of cascades on social networks. And there’s more gold them thar hills.
Ref: arxiv.org/abs/1403.4608 Can Cascades be Predicted?