How to Use Twitter for Personal Data Mining

A free text editor and a backup tool is all we need to query our 140-character oeuvre.

Christopher Mimsarchive page

October 13, 2010

Your Twitter stream (and, soon, your history of Facebook wall posts) constitutes a rich source of information about you, or, considering that most Tweets are public, just about anyone else. It includes everything from your speech patterns and the topics you obsess over to the identity of your “real” friends–at least in the tweet-o-sphere. The trick is unlocking it.

The first step to jacking into your subconscious is to download a representative sample of your tweets. Fortunately, Twitter stores the last 3,200 of them, which is probably more than you’ve ever spouted, unless you’re especially prolific.

One easy way to download all those tweets is to sign up for a service like BackupMy.Net, which will grab all available tweets (up to that 3,200 limit) and allow you to download them in a number of different formats.

Step two, get yourself a free copy of TextWrangler (if you’re on a Mac) or its Windows equivalent. If you’re on Linux, vi or emacs can do everything I’m about to describe.

Step three, start filtering your corpus of tweets. Think of it as a mound of wet clay. You can ask it any questions you like. Here’s what I asked mine; if you have other ideas, leave them in the comments.

i) Delete every other line in order to eliminate the noise of all those time and date stamps. TextWrangler has a powerful “Process Lines Containing…” feature under the “Text” menu. It allows you to, in this case, delete every line containing the string “+0000” which I’m guessing is the blank variable where geo-location data would normally go. If your Tweets are geo-located, just have it search for every year in which you tweeted, assuming you don’t often drop years into your tweets.

This:

Becomes this:

Now paste the resulting text into Wordle, the word cloud generator, because word clouds are just pretty ways to visualize word frequency.

For a first approximation, it’s not bad - some of your interests and friends can be seen trying to peek out around garbage words we all use on Twitter, like “RT”

ii) Using a simple find and replace command, get rid of RT, via and any other words that don’t tell you anything. Now you can see what you’re obsessed with. I happened to be obsessed with climate change, news buzzwords (“new,” “now”) and, apparently, signaling sarcasm and ironic distance by beginning sentence fragments with the word “apparently.”

iii) For a purer distillate of your own speech patterns and neuroses, use the “Process Lines Containing…” command or its equivalent to remove all lines that are retweets.

iv) If you want to find out who your “real friends” are on Twitter (and, to a lesser extent, what you talk to them about) extract all the lines that have an @ in them.

If you want to know what you say when you’re talking to your friends, simply delete every word beginning with an @. Or sort the document so you can grab only every conversation you’ve ever had with a particular person, and turn *that* into a word cloud.

v) If you want to know you talk when you’re talking about yourself, simply extract every line with the (case-sensitive) string “I “

There are countless tools on the web for analyzing your twitter stream, from frequency analysis to emotional content, but so far as I know, downloading all your tweets and parsing them yourself is the only way to visualize their actual content with this level of specificity. I’m sure there are dozens of queries I haven’t even thought of that are possible with this method - feel free to leave your ideas in the comments.

Follow Mims on Twitter or contact him via email.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.