A New Dimension for Your Photos

Web service Fotowoosh wants to be the Flickr of 3-D.

Wade Rousharchive page

April 27, 2007

Looking at the photo prints from your Washington, D.C., vacation can prompt memories of being at real, three-dimensional places like the Lincoln Memorial. But what if you could actually walk into your photograph and stand at Lincoln’s feet all over again–or at least zoom inside a 3-D version of your image on a computer screen? A new Web service called Fotowoosh promises to deliver such an experience, courtesy of computer-vision researchers at Carnegie Mellon University, in Pittsburgh.

**The big picture:** Fotowoosh’s Web-based system turns flat, two-dimensional images into zoomable 3-D scenes using machine-learning algorithms that subdivide a picture into areas representing ground (green in the false-color image), vertical surfaces (arrows), and sky (circles). These surfaces are then “folded” into a 3-D model resembling a pop-up illustration in a children’s book. To examine a sample Fotowoosh model in 3-D, you’ll need to download both a “.wrl” file from the Fotowoosh website and a 3-D viewer extension for your Web browsers.

Derek Hoiem, a doctoral candidate at Carnegie Mellon’s Robotics Institute, has spent the past year and a half figuring out how to get software to convert flat images into 3-D virtual-reality models that can be manipulated on-screen. Working with faculty members Alexei Efros and Martial Hebert, Hoiem came up with a machine-learning system that identifies various surfaces and their orientations based on what it has learned from examining previous photos. In essence, Fotowoosh frees the person viewing a photograph from the photographer’s point of view so that he or she can explore perspectives other than the one the camera actually captured.

Now Freewebs, a Silver Spring, MD, company that hosts 14 million personal websites, is about to launch a consumer version of Hoiem’s software on the Web. Freewebs president Shervin Pishevar says that he hopes Web users will upload thousands of photographs to Fotowoosh and share the 3-D versions with other visitors, making the service into what he calls “a 3-D Flickr.” Flickr is, of course, Yahoo’s highly popular photo-sharing and social-networking site.

A test version of the Fotowoosh system will be launched in May, Pishevar says. The system works best on outdoor images. Converted photos look a bit like the illustrations in children’s pop-up books: there’s an obvious “ground” corresponding to the flat page in a pop-up book, and vertical surfaces stand at right angles to the ground, representing objects such as walls, trees, and vehicles. The images appear inside a Web page loaded with a special viewer with controls for zooming, panning, and rotating the 3-D model. While the software literally adds a new dimension to old tourist photos, in the future it could also be applied for purposes such as robot navigation or building photorealistic 3-D virtual worlds.

Multimedia

Video: A New Dimension for Your Photographs

Hoiem says the software mimics some of the tricks our brains use to give depth to the two-dimensional images constantly landing on our retinas. Traditional (nonstereoscopic) cameras only have one “eye” compared with our two. That means they can’t take advantage of parallax–the phenomenon in which our right and left eyes see nearby objects in slightly different positions relative to objects farther away–to get a stereo image. In fact, it’s mathematically impossible for software to compute the shape of a 3-D scene from a single two-dimensional image with 100 percent confidence, since the objects in the scene could theoretically be any distance away. “But people can do it,” notes Hoiem. “It’s just that there is not a simple algebraic solution.”

In fact, parallax isn’t strictly required for 3-D vision: if you shut one eye, the world doesn’t go flat. The brain infers depth using all sorts of cues such as shading, color, motion, and our learned experience about the spatial relationships between floors and walls, or between streets and buildings. “It turns out that using a fairly simple model–thinking of the world in terms of a ground surface, vertical surfaces that stick up out of it, and the sky–you can create pretty compelling 3-D models,” says Hoiem.

The software that he, Efros, and Hebert developed starts converting an image by trying to group each pixel in a two-dimensional image into one of these classes. Sky is usually the easiest–it’s blue or white. The top and bottom edges of most photos are aligned with the horizon, which helps the software identify the ground plane. And the windows of a multistory building are often arranged in parallel lines with a common vanishing point–a strong indication of a vertical surface.

But Hoiem didn’t explicitly teach the software these rules. The system is based on machine-learning algorithms, meaning that it figures out its own rules of thumb by recognizing statistical patterns in hundreds of images in which the ground, sky, and vertical surfaces have been prelabeled by humans.

“We didn’t have to start completely from scratch, fortunately,” says Hoiem. “There’s been a lot of work on how we represent color and texture and structure. There is an existing algorithm for recognizing the vanishing point of a group of lines. And people have worked a lot on recognizing objects like people or cars. But nobody had thought that maybe you can combine all of these and learn to recognize the actual geometry of a scene.”

Once Fotowoosh has identified the major surfaces in a scene, it joins them into a 3-D model using the Virtual Reality Markup Language file format, or VRML. The software peels off parts of the two-dimensional image and pastes them onto the appropriate surfaces in the model, a process called texture mapping.

Currently, the finished models can only be viewed inside a Web browser equipped with a special extension for viewing VRML files. But in the beta version of Fotowoosh, due next month, the models will be displayed using the more common Flash format already included in most browsers, according to Pishevar. (The Fotowoosh home page includes a video demonstrating the end product for several sample images.)

Right now, the system isn’t very good at separating discrete objects that should be in the foreground, such as pedestrians in a street scene, from background surfaces, such as walls. But Hoiem is working on that. “In a year or possibly less, you’ll be able to take a photo of an alley with all sorts of cars and people, and create a 3-D model where those are all seen as separate 3-D foreground objects,” he says.

Some 17,000 people have joined the waiting list for the beta software, according to Pishevar. Visitors will be able to upload and convert their own photos into 3-D models and store them in a gallery area, and Freewebs members will be able to embed the models in their own Web pages.

Pishevar says that Freewebs approached Hoiem about commercializing his work because it “aligns very much with our vision of transforming the kinds of personal media that can live on a visual Web page. It’s one of those technologies that changes the way you see things in the world and what you think is possible.”

Eventually, Hoiem’s work could change the way robots use computer vision to navigate their way through obstacle-strewn environments. Hoiem says that he and his colleagues are also working on ways to create more-complicated 3-D models by processing multiple photographs of the same area. In addition, they’re working on the idea of animating 3-D scenes such as busy streets by predicting the directions pedestrians and cars would have moved in the several seconds following the click of the photographer’s shutter.

Because the Fotowoosh models adhere to a standard 3-D format, VRML, they could easily be imported into other 3-D applications, such as modeling software; immersive virtual worlds, such as Linden Lab’s Second Life; and “virtual globe” systems, such as Google Earth and Microsoft Virtual Earth. The ability to create texture-mapped 3-D buildings inside these worlds from a few two-dimensional photos would be a big advance over current methods, which involve constructing a 3-D model from blueprints or other data, then manually pasting photographs onto each side. “That’s just not scalable,” says Pishevar. “But if you have people uploading billions of 3-D pictures from around the world … then you could get to the point of building applications” that use the models to automatically fill out landscapes of virtual Earths.

Pishevar says that Freewebs will eventually provide an application programming interface, or API, that software developers can use to create just such “mashups.” Asked whether his company is already in discussions with the likes of Google or Linden Lab, Pishevar is coy: “We can’t actually comment on that right now.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.