3-D Modeling Advance

A single photo can be reconstructed into a 3-D scene with Make3D.

Brittany Sauserarchive page

March 7, 2008

Researchers at Stanford University have developed a Web service called Make3D that lets users turn a single two-dimensional image of an outdoor scene into an immersive 3-D model. This gives users the ability to easily create a more realistic visual representation of a photo–one that lets viewers fly around the scene.

**Immersive photos:** This 3-D model of the Trevi Fountain in Rome (above) was created using Make3D, a Web service that reconstructs a single photo into a 3-D visualization.

To convert the still images into 3-D visualizations, Andrew Ng, an assistant professor of computer science, and Ashutosh Saxena, a doctoral student in computer science, developed a machine-learning algorithm that associates visual cues, such as color, texture, and size, with certain depth values based on what they have learned from studying two-dimensional photos paired with 3-D data. For example, says Ng, grass has a distinctive texture that makes it look very different close up than it does from far away. The algorithm learns that the progressive change in texture gives clues to the distance of a patch of grass.

Larry Davis, a professor and chair of the computer-science department at the University of Maryland, in College Park, says that turning a single image into a 3-D model has been a hard and mathematically complicated problem in computer vision, and that even though Make3D gets things wrong, it often produces remarkable results.

Make3D is not the first site to extract a 3-D model from a single image. Researchers at Carnegie Mellon University (CMU) launched Fotowoosh in May 2007. (See “A New Dimension for Your Photos.”) But Fotowoosh’s algorithm is limited because it labels the orientation of surfaces as either horizontal or vertical, without taking into account such things as mountain slopes, rooftops, or even staircases, says Ng. Based on these restrictive assumptions, Fotowoosh infers the depth of a scene to reconstruct the image.

Multimedia

See a 3-D model of a building in Amsterdam.

See a 3-D model of the Trevi Fountain in Rome.

See a 3-D model of the Pyramids.

Derek Hoiem, who built Fotowoosh’s algorithm, admits that unlike Make3D, his work does not give a good estimation of depth. “I can say this point [in an image] is a lot further [away from the foreground] than that point, but I can’t say that this point is five meters away,” he explains. Hoiem developed Fotowoosh with CMU faculty members Alexei Efros and Martial Hebert. (Hoiem is currently a postdoctoral fellow at the Beckman Institute at the University of Illinois at Urbana-Champaign.)

Saxena says that by knowing the depth values of objects in a scene, Make3D creates a higher-quality 3-D model. In a survey performed by the Stanford researchers in 2006, users preferred Make3D’s images twice as often as images created using Fotowoosh.

Hoiem says that he has been trying to extend his work to deal with arbitrary angles, but he has yet to come up with a solution and is “impressed” with the Stanford researchers’ work.

To build Make3D’s algorithm, the Stanford researchers used a laser scanner to estimate the distance of every pixel or point in a two-dimensional image. That 3-D information was coupled with the image and reviewed by the algorithm so that it could learn to correlate visual properties in the image with depth values. For example, it will learn that a large blue pad is probably part of the sky and farther away, says Saxena. There are thousands of such visual properties that humans unconsciously use to determine depth. The Make3D algorithm learns these kinds of rules and processes images accordingly, he says.

To process an image, the algorithm divides the still image into tiny pieces or segments, says Ng. “It tries to take each of these small pieces and simultaneously figure out their 3-D position, angle, and orientation in the image.”

When a new image is uploaded on the site, it only takes a couple of minutes for the algorithm to reconstruct it to a 3-D model and make a movie of the scene. However, the website is not yet optimal, so it takes about an hour for the user to receive an e-mail message indicating that her visualizations are ready. A user can store images and movies in a personal gallery on the site. The researchers are working to connect their site to photo-sharing sites like Photobucket and Flickr, says Saxena.

Make3D can also take two or three images of the same location to create a 3-D model similar to Microsoft’s Photosynth application. (See “Microsoft’s Shiny New Toy.”) But Photosynth is a more expansive project that uses hundreds of images to reconstruct a scene, and when there are that many images to work with, computing the depth of scenes is not as mathematically complicated and is more accurate, says Hoiem. Make3D’s focus is on processing single images for the general consumer, who might only take one image of a scene, says Ng.

Alex Daley, the group product manager for Microsoft Live Labs, says that there is a complementary relationship between single-image processors and multiple-image processors: improving single-image processing will ultimately make it easier for other systems to match multiple photos together. “Mixing and matching these for the right set of images will provide the best set of results,” Daley adds. (He says that Microsoft is open to working with applications such as Make3D, but the company has not yet spoken with the Stanford researchers.)

Make3D’s current algorithm only works on outdoor scenes or landscapes and a few kinds of indoor scenes, such as those that focus on staircases, and it’s meant to help users share experiences or relive their own. The researchers are working to extend the algorithm to a broader range of settings so that it can recognize things like humans and coffee mugs and be used to create real-life environments for gaming and virtual worlds. Saxena is also working to incorporate the technology into robots to improve navigation and assist them at carrying out such tasks as unloading a dishwasher.

CMU’s Efros says that the work provides a new perspective on the computer-vision problem and will hopefully result in a deeper understanding of how human vision functions.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.