Working with Szeliski and a University of Washington professor named Steve Seitz, Snavely was intent on coding a way forward through a computationally forbidding challenge: how to get photos to merge, on the basis of their similarities, into a physical 3-D model that human eyes could recognize as part of an authentic, real-world landscape. Moreover, the model should be one that users could navigate and experience spatially. Existing photo-stitching software used in electronic devices such as digital cameras knew how to infer relationships between images from the sequence in which they’d been taken. But Snavely was trying to develop software capable of making such assessments in a totally different way. He devised a two-step process: “In the first step, we identify salient points in all the 2-D images,” he says. “Then we try and figure out which points in different images correspond to the same point in 3-D.”
“The process,” Snavely says, “is called ‘structure from motion.’ Basically, a moving camera can infer 3-D structure. It’s the same idea as when you move your head back and forth and can get a better sense of the 3-D structure of what you’re looking at. Try closing one eye and moving your head from side to side: you see that different points at different distances will move differently. This is the basic idea behind structure from motion.”
Computer vision, as Agüera y Arcas explains, benefits from a simple assurance: all spatial data is quantifiable. “Each point in space has only three degrees of freedom: x, y, and z,” he says.
Attributes shared by certain photos, he adds, help mark them as similar: a distinctively shaped paving stone, say, may appear repeatedly. When the software recognizes resemblances–the stone in this photo also appears in that one–it knows to seek further resemblances. The process of grouping together images on the basis of matching visual elements thus gathers steam until a whole path can be re-created from those paving stones. The more images the system starts with, the more realistic the result, especially if the original pictures were taken from a variety of angles and perspectives.
That’s because the second computational exercise, Snavely says, is to compare images in which shared features are depicted from different angles. “It turns out that the first process aids the second, giving us information about where the cameras must be. We’re able to recover the viewpoint from which each photo was taken, and when the user selects a photo, they are taken to that viewpoint.” By positing a viewpoint for each image–calculating where the camera must have been when the picture was taken–the software can mimic the way binocular vision works, producing a 3-D effect.
As Szeliski knew, however, the human eye is the most fickle of critics. So he and his two colleagues sought to do more than just piece smaller parts into a larger whole; they also worked on transition effects intended to let images meet as seamlessly as possible. The techniques they refined include dissolves, or fades, the characteristic method by which film and video editors blend images.
In a demo that showed the Trevi Fountain in Italy, Photo Tourism achieved a stilted, rudimentary version of what Photosynth would produce: a point cloud assembled from images that represent different perspectives on a single place. More impressive was the software’s ability to chug through banks of images downloaded from Flickr based on descriptive tags–photos that, of course, hadn’t been taken for the purpose of producing a model. The result, Szeliski remembers, was “surprising and fresh” even to his veteran’s eyes.
“What we had was a new way to visualize a photo collection, an interactive slide show,” Szeliski says. “I think Photo Tourism was surprising for different reasons to insiders and outsiders. The insiders were bewildered by the compelling ease of the experience.” The outsiders, he says, could hardly believe it was possible at all.