One of the optional extras that Twitter allows is for each tweet to be tagged with the user’s location data. That’s useful if you want people to know where you are or so that you can later remember where certain events took place. It also gives researchers a valuable tool for studying the geographical distribution of tweets in various ways.
But it also raises privacy issues, particularly when users are unaware, or forget that, their tweets are geotagged. Various celebrities are thought to have given away their home locations in this way. And in 2007, four Apache helicopters belonging to the US Army were destroyed by mortars in Iraq when insurgents worked out their location using geotagged images published by American soldiers.
Perhaps these kinds of concerns are the reason why so few tweets are geotagged. Several studies have shown that less than one per cent of tweets contain location metadata.
But the absence of geotagging data does not mean your location is secret. Today, Jalal Mahmud and a couple of pals at IBM Research in Almaden, California, say they’ve developed an algorithm that can analyse anybody’s last 200 tweets and determine their home city location with an accuracy of almost 70 per cent.
That could be useful for researchers, journalists, marketers and so on wanting to identify where tweets originate. But it also raises privacy issues for those who would rather their home location remained private.
Mahmud and co’s method is relatively straightforward. Between July and August 2011, they filtered the Twitter firehose for tweets that were geotagged with any of the biggest 100 cities in the US until they had found 100 different users in each location.
They then downloaded the last 200 tweets posted by each user, rejecting those that posted privately. That left them with over 1.5 million geotagged tweets from almost 10,000 people.
They then divided this data set in two, using 90 per cent of the tweets to train their algorithm and the remaining 10 per cent to test it against.
The basic idea behind their algorithm is that tweets contain important information about the probable location of the user. For example, over 100,000 tweets in the dataset were generated by the location-based social networking site Foursquare and so contained a link that gave the exact location. And almost 300,000 tweets contained the name of cities listed in the US Geological Service gazetteer.
Other tweets contained clues to their location like phrases such as “Let’s Go Red Sox”, a reference to the Boston-based baseball team. And Mahmud and co say that distribution of tweets throughout the day is roughly constant across the US, shifted by time zone. So a user’s pattern of tweets throughout the day can give a good indication of which time zone they’re in.
So the question these guys set out to answer was whether it was possible to use this information to predict a user’s home location, a result they could test by matching it against the user’s geotagged metadata.
Mahmud and co used an algorithm known as a Naive Bayes Multimonial to do the number crunching. The trained it by feeding it the training dataset along with the geolocation data.
They then tested the algorithm on the remaining 10 per cent of the data to see whether it could predict the geolocation.
The results are interesting. They say that when they exclude people who are obviously travelling, their algorithm correctly predicts people’s home cities 68 per cent of the time, their home state 70 per cent of the time and their time zone 80 per cent of the time. And they say their algorithm takes less than a second to do this for any individual
That could be a useful tool. Journalists, for example, could use it to determine which tweets were coming from a region involved in a crisis, such as an earthquake, and those that were just commenting from afar. Marketers might use it to work out the popularity of their products in certain cities.
And it also suggests ways that people can improve their privacy–by not mentioning their home location, of course.
Mahmud and co say their algorithm could do better in future. For example, they think they can get more fine-grained detail by searching tweets for mentions of local landmarks that can be pinpointed more accurately. Whether that turns out to be possible, we’ll have to wait and see.
An interesting corollary to all this is that our notion of privacy is more fragile than most of us realise. Just how we can strengthen and protect it should be the subject of considerable public debate.
Ref: arxiv.org/abs/1403.2345 Home Location Identification of Twitter Users