How Facebook Copes with 300 Million Users

VP of Engineering Mike Schroepfer reveals the tricks that keep the world’s biggest social network going.

Erica Naonearchive page

September 22, 2009

Last week, the world’s biggest social network, Facebook, announced that it had reached 300 million users and is making enough money to cover its costs.

The challenge of dealing with such a huge number of users has been highlighted by hiccups suffered by some other social-networking sites. Twitter was beleaguered with scaling problems for some time and became infamous for its “Fail Whale”–the image that appears when the microblogging site’s services are unavailable.

In contrast, Facebook’s efforts to scale have gone remarkably smoothly. The site handles about a billion chat messages each day and, at peak times, serves about 1.2 million photos every second.

Facebook vice president of engineering Mike Schroepfer will appear on Wednesday at Technology Review’s EmTech@MIT conference in Cambridge, MA. He spoke with assistant editor Erica Naone about how the company has handled a constant flow of new users and new features.

Technology Review: What makes scaling a social network different from, say, scaling a news website?

Mike Schroepfer: Almost every view on the site is a logged-in, customized page view, and that’s not true for most sites. So what you see is very different than what I see, and is also different than what your sister sees. This is true not just on the home page, but on everything you look at throughout the site. Your view of the site is modified by who you are and who’s in your social graph, and it means we have to do a lot more computation to get these things done.

TR: What happens when I start taking actions on the site? It seems like that would make things even more complex.

MS: If you’re a friend of mine and you become a fan of the Green Day page, for example, that’s going to show up in my homepage, maybe in the highlights, maybe in the “stream.” If it shows me that, it’ll also say three of [my] other friends are fans. Just rendering that home page requires us to query this really rich interconnected dataset–we call it the graph–in real time and serve it up to the users in just a few seconds or hopefully under a second. We do that several billion times a day.

TR: How do you handle that? Most sites deal with having lots of users by caching–calculating a page once and storing it to show many times. It doesn’t seem like that would work for you.

MS: Your best weapon in most computer science problems is caching. But if, like the Facebook home page, it’s basically updating every minute or less than a minute, then pretty much every time I load it, it’s a new page, or at least has new content. That kind of throws the whole caching idea out the window. Doing things in or near real time puts a lot of pressure on the system because the live-ness or freshness of the data requires you to query more in real time.

We’ve built a couple systems behind that. One of them is a custom in-memory database that keeps track of what’s happening in your friends network and is able to return the core set of results very quickly, much more quickly than having to go and touch a database, for example. And then we have a lot of novel system architecture around how to shard and split out all of this data. There’s too much data updated too fast to stick it in a big central database. That doesn’t work. So we have to separate it out, split it out, to thousands of databases, and then be able to query those databases at high speed.

TR: What happens when you add new features to the site?

MS: Adding or changing a feature can pretty dramatically affect the behavior of the user, which has pretty dramatic implications on the system architecture. I’ll give a very simple example. We added the “Like” feature in February of this year. It’s a single-button thumbs up so the user can say, “I like this thing.” There was a long debate internally about whether the “Like” feature was going to cannibalize commenting. It turned out to be additive; the commenting rate stayed the same and “Like” became one of the most common actions in the system.

This sounds really trivial, but one of the challenges of building complex, scalable systems is always that [it’s easier to retrieve data from a database than to store it there]. Every time I click on that “Like” button, we have to record that somewhere persistently. If [we built the system assuming that we’d be mostly retrieving data], we just blew that assumption by changing the features of the product. I think we try pretty hard to not be too set on any of those assumptions and be ready to revisit them as we change the core product. That’s pretty critical.

TR: And how about hooking these new features into the existing architecture?

MS: I think one of the most interesting things is that we can turn a feature on. Going from zero users to 300 million users in an afternoon for a brand-new feature is pretty crazy. And we can do that because, generally speaking, we share all of the infrastructure. You can turn it on and have it go from 1 percent adoption to 100 percent adoption in a day without much or any perceived downtime.

TR: But you don’t just have a problem with change and complexity–there’s also the issue of storage. Facebook serves tons of photos. Was that system always built to scale?

MS: Now especially–with camera phones and direct integration via [smartphone applications]–there’s just a tremendous wealth of photos being uploaded and shared on the site. We built the first version of our photo storage using off-the-shelf network-attached storage devices with Web servers in front of them. That was functional but not functional enough, and it was also expensive. We did some tuning on that system to improve the performance and got it five or six times faster than the original version. Then we went and built our own storage system called Haystack that’s completely built on top of commodity hardware. It’s all sata drives and an Intel box with a custom stack on top of it that allows us to store and then serve the photos from the storage tier. That’s significantly faster than the off-the-shelf solutions and also significantly cheaper. We’ve invested a lot of energy in storing photos because the scale is just astounding.

TR: Do you always know that you’re going to be able to pull off the changes you try to make to the architecture?

MS: There’s been a couple cases where we’ve taken on a project where we weren’t actually sure we could do it–there’s one I can’t talk about because we’ll announce it later in the year. There are cases where we’re going to try to do something that lots of other people have tried before, but we think we can do it better. I think the courage and the willingness to make the investment are actually the most critical parts of this, because without that, all the great planning in the world isn’t going to get you there.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.