Software Learns to Tag Photos

Thousands of online images from Flickr have already been tagged accurately by a new software program.

James Leearchive page

November 9, 2006

U.S. researchers have released a new online program for automatically tagging images according to their content. In its first real-world test, the program processed thousands of publicly accessible images available on the photo-sharing site Flickr. At least one accurate tag was generated for 98 percent of all the pictures analysed.

The new software, called ALIPR (Automatic Linguistic Indexing of Pictures), uses a combination of statistical techniques to process an image and assign it a batch of 15 words, arranged in order of perceived relevance. These words may refer to a specific object within the picture, such as a “person” or “car,” or to a more general theme, such as “outdoors” or “manmade.”

For humans, deciphering an image is deceptively simple. And yet for computers, which can sort through millions of text documents with blistering speed and accuracy, identifying the content of an image remains a devilishly difficult task.

“Recognizing what an image is about semantically is one of the most difficult problems in AI,” says Jia Li, a mathematician at Pennsylvania State University, in State College, who created the software with colleague James Wang, a member of the College of Information Sciences and Technology. “Objects in the real world are 3-D,” Li explains. “When showing up in an image, they can vary vastly in color, shape, gesture, size, and position, and a computer usually has no prior knowledge about the variations.”

Because a complex understanding of the world remains beyond the ability of computers, more-efficient vision-processing algorithms are needed to help them mimic human vision and intelligence.

ALIPR analyses an image pixel by pixel and applies a novel statistical method to calculate the probability that a particular word may describe its content. This involves examining the distribution of color and texture within the image and comparing these features with a stored database of words and images. Li and Wang trained their program using a commercial database containing around 50,000 images that had already been tagged.

Recently, they tested ALIPR on 5,411 previously unseen images available on the popular picture-sharing site Flickr. For 51 percent of these images, the first word generated by ALIPR appeared in users’ tags. The program also produced at least one accurate word 98 percent of the time. The researchers employed images made publicly accessible by Flickr users, which were also openly accessible through Flickr’s own Application Programming Interface.

Better image-recognition software could have a range of applications, Li says. It could, for example, improve Internet search engines or automatically tag digital-image collections. Li believes it might also help scientists sort through large amounts of visual information: “Image classification is sometimes a need in scientific study. Without computer assistance, researchers have to manually classify images, and this process can be slow and fall behind the high throughput of new images.”

The underlying algorithms could perhaps lend themselves to various other difficult computing tasks. “Similar approaches can be applied to video analysis and possibly other problems,” Li adds.

Luis von Ahn, an assistant professor of computer science at Carnegie Mellon University, in Pittsburgh, PA, says the research is a “step in the right direction” but that the software’s accuracy rate must be improved. He notes that images on sites like Flickr often contain very similar material. “The truth of the matter is that these images are largely all about the same thing–people mostly take pictures of other people,” he says. “So just using the word ‘people’ already tags a large percentage of the images correctly.”

Von Ahn also believes that humans could play a greater role in training vision-recognition algorithms. He runs a site called Peekaboom that turns tagging images into a game for two online players. As an image is slowly revealed, each player must race to find the right tag for it. This helps train von Ahn’s software to identify images by focusing on key portions. So far, approximately 100,000 individual images have been classified using Peekaboom, von Ahn says.

Alexander Berg, a computer-vision expert based at the University of California, in Berkley, agrees that humans could help computers understand complex data better. He suggests that the tags that appear on sites like Flickr and YouTube, as well as on many blogs and news websites, could prove crucial to this endeavor in the future. “In general, image and video search is an area due for major strides,” Berg says. “More and more data is online with some amount of human labeling.”

It’s an idea that is welcomed by Li: “The more reliable data we can access and use, the better.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.