Mmmm, data science. (cc Evan Blaser)
(Below, I’ve included the whole of an email interview I conducted with Chen, which you might want to skip to if you’re looking for a general overview of his work. He reveals, among other things, that he’s considered mining Twitter data to see whether or not people eat fast food when they’re sad.)
Data science is so new that there are no textbooks on the subject, and no university curricula designed to turn out data scientists. Yet it’s integral to everything from quantitative trading on Wall Street to ad targeting on the web and the optimization of real-world supply chains.
Before he was mining terabytes of tweets for insights that could be turned into interactive visualizations, Chen honed his skills studying linguistics and pure mathematics at MIT. That’s typically atypical for a data scientist, who have backgrounds in mathematically rigorous disciplines, whatever they are. (At Twitter, for example, all data scientists must have at least a Master’s in a related field.)
Here’s one of the wackier examples of the versatility of data science, from Chen’s own blog. In a post with the rousing title Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process, Chen delves into the problem of clustering. That is, how do you take a mass of data and sort it into groups of related items? It’s a tough problem – how many groups should there be? what are the criteria for sorting them? – and the details of how he tackles it are beyond those who don’t have a background in this kind of analysis.
For the rest of us, Chen provides a concrete and accessible example: McDonald’s
By dumping the entire menu of McDonald’s into his mathemagical sorting box, Chen discovers, for example, that not all McDonald’s sauces are created equal. Hot Mustard and Spicy Buffalo do not fall into the same cluster as Creamy Ranch, which has more in common with McDonald’s Iced Coffee with Sugar Free Vanilla Syrup than it does with Newman’s Own Low Fat Balsamic Vinaigrette.
Other clusters appear, including all the burger-y items, breakfast foods and sugar drinks. So far, not so surprising, until you get to the one cluster on McDonald’s menu that contains only one item.
What’s so special about McDonald’s Fruit & Maple Oatmeal? It’s probably its fiber content, relatively (I stress relatively) high levels of nutrients and lower levels of sugar, trans fat and cholesterol.
In other words, when one of Twitter’s newest data scientists applies his craft to McDonald’s menu, his algorithm automatically extracts the only food on it that any of us should probably even consider eating. Oatmeal: at McDonald’s it’s truly in a class of its own.
Here’s the full interview with Chen:
1. How long have you been a data scientist at Twitter?
I’ve been at Twitter for about four months.
2. What does a data scientist at Twitter do?
We work on everything from building machine learning models and improving our large-scale data processing frameworks, to creating data visualizations, running statistical analyses, and finding better ways to understand our users and the Twitter graph. There’s a lot of variety, and it really depends on each person’s skills and interests.
At any given time, for example, I’m likely to be experimenting with new ad targeting algorithms, writing MapReduce jobs to mine terabytes of tweets (using Scalding, our in-house MapReduce language), building interactive visualizations to surface insights in all the data we gather, writing a report to explain some new findings, running an experiment on Mechanical Turk, and lots more.
3. Was your latest post (on clustering) inspired by something you’re working on at Twitter (that you can discuss)?
I’ve been doing some work on clustering our users and advertisers, automatically inferring topic categories in text, and thinking about what we can learn from food on Twitter (for example, do men and women, or San Franciscans and New Yorkers, differ in what they eat? is there any relationship between what people eat and what they tweet, e.g., are people more likely to eat junk food when they’re sad?). So while the post wasn’t directly inspired by what I’m working on at Twitter, it’s definitely related.
4. Data science is a thing now, but (I’ve been told) the field is “so new” that there are no textbooks or university courses specific to it. Do you agree / disagree?
I agree – but it depends on your definition of data science (which many people disagree on!). For me, data science is a mix of three things: quantitative analysis (for the rigor necessary to understand your data), programming (so that you can process your data and act on your insights), and storytelling (to help others understand what the data means). So useful skills for a data scientist to have could include:
* Statistics, machine learning (on the quantitative analysis side). For example, it’s impossible to extract meaning from your data if you don’t know how to distinguish your signals from noise. (I’ll stress, though, that I believe any kind of strong quantitative ability is fine – my own background was originally in pure math and linguistics, and many of the other folks here come from fields like physics and chemistry. You can always pick up the specific tools you’ll need.)
* General programming ability, plus knowledge of specific areas like MapReduce/Hadoop and databases. For example, a common pattern for me is that I’ll code a MapReduce job in Scala, do some simple command-line munging on the results, pass the data into Python or R for further analysis, pull from a database to grab some extra fields, and so on, often integrating what I find into some machine learning models in the end.
* Web programming, data visualization (on the storytelling side). For example, I find it extremely useful to be able to throw up a quick web app or dashboard that allows other people (myself included!) to interact with data – when communicating with both technical and non-technical folks, a good data visualization is often a lot more helpful and insightful than an abstract number.
While there aren’t many textbooks or courses that cover all three areas (one exception may be Jeff Hammerbacher and Mike Franklin’s course at Berkeley: http://datascienc.es/), there are of course resources that cover each skill alone. (Data visualization seems to continue to be an underappreciated skill, though, so classes in that area are more rare.)