TR: What happens when you add new features to the site?
MS: Adding or changing a feature can pretty dramatically affect the behavior of the user, which has pretty dramatic implications on the system architecture. I’ll give a very simple example. We added the “Like” feature in February of this year. It’s a single-button thumbs up so the user can say, “I like this thing.” There was a long debate internally about whether the “Like” feature was going to cannibalize commenting. It turned out to be additive; the commenting rate stayed the same and “Like” became one of the most common actions in the system.
This sounds really trivial, but one of the challenges of building complex, scalable systems is always that [it’s easier to retrieve data from a database than to store it there]. Every time I click on that “Like” button, we have to record that somewhere persistently. If [we built the system assuming that we’d be mostly retrieving data], we just blew that assumption by changing the features of the product. I think we try pretty hard to not be too set on any of those assumptions and be ready to revisit them as we change the core product. That’s pretty critical.
TR: And how about hooking these new features into the existing architecture?
MS: I think one of the most interesting things is that we can turn a feature on. Going from zero users to 300 million users in an afternoon for a brand-new feature is pretty crazy. And we can do that because, generally speaking, we share all of the infrastructure. You can turn it on and have it go from 1 percent adoption to 100 percent adoption in a day without much or any perceived downtime.
TR: But you don’t just have a problem with change and complexity–there’s also the issue of storage. Facebook serves tons of photos. Was that system always built to scale?
MS: Now especially–with camera phones and direct integration via [smartphone applications]–there’s just a tremendous wealth of photos being uploaded and shared on the site. We built the first version of our photo storage using off-the-shelf network-attached storage devices with Web servers in front of them. That was functional but not functional enough, and it was also expensive. We did some tuning on that system to improve the performance and got it five or six times faster than the original version. Then we went and built our own storage system called Haystack that’s completely built on top of commodity hardware. It’s all sata drives and an Intel box with a custom stack on top of it that allows us to store and then serve the photos from the storage tier. That’s significantly faster than the off-the-shelf solutions and also significantly cheaper. We’ve invested a lot of energy in storing photos because the scale is just astounding.
TR: Do you always know that you’re going to be able to pull off the changes you try to make to the architecture?
MS: There’s been a couple cases where we’ve taken on a project where we weren’t actually sure we could do it–there’s one I can’t talk about because we’ll announce it later in the year. There are cases where we’re going to try to do something that lots of other people have tried before, but we think we can do it better. I think the courage and the willingness to make the investment are actually the most critical parts of this, because without that, all the great planning in the world isn’t going to get you there.