Skip to Content
Artificial intelligence

The Revolutionary Technique That Quietly Changed Machine Vision Forever

Machines are now almost as good as humans at object recognition, and the turning point occurred in 2012, say computer scientists.

In space exploration, there is the Google Lunar X Prize for placing a rover on the lunar surface. In medicine, there is the Qualcomm Tricorder X Prize for developing a Star Trek-like device for diagnosing disease. There is even an incipient Artificial Intelligence X Prize for developing an AI system capable of delivering a captivating TED talk.

In the world of machine vision, the equivalent goal is to win the ImageNet Large-Scale Visual Recognition Challenge. This is a competition that has run every year since 2010 to evaluate image recognition algorithms. (It is designed to follow-on from a similar project called PASCAL VOC which ran from 2005 until 2012).

Contestants in this competition have two simple tasks. Presented with an image of some kind, the first task is to decide whether it contains a particular type of object or not. For example, a contestant might decide that there are cars in this image but no tigers. The second task is to find a particular object and draw a box around it. For example, a contestant might decide that there is a screwdriver at a certain position with a width of 50 pixels and a height of 30 pixels.

Oh, and one other thing: there are 1,000 different categories of objects ranging from abacus to zucchini, and contestants have to scour a database of over 1 million images to find every instance of each object. Tricky!

Computers have always had trouble identifying objects in real images so it is not hard to believe that the winners of these competitions have always performed poorly compared to humans.

But all that changed in 2012 when a team from the University of Toronto in Canada entered an algorithm called SuperVision, which swept the floor with the opposition.

Today, Olga Russakovsky at Stanford University in California and a few pals review the history of this competition and say that in retrospect, SuperVision’s comprehensive victory was a turning point for machine vision. Since then, they say, machine vision has improved at such a rapid pace that today it rivals human accuracy for the first time.

So what happened in 2012 that changed the world of machine vision? The answer is a technique called deep convolutional neural networks which the Super Visison algorithm used to classify the 1.2 million high resolution images in the dataset into 1000 different classes.

This was the first time that a deep convolutional neural network had won the competition, and it was a clear victory. In 2010, the winning entry had an error rate of 28.2 percent, in 2011 the error rate had dropped to 25.8 percent. But SuperVision won with an error rate of only 16.4 percent in 2012 (the second best entry had an error rate of 26.2 percent). That clear victory ensured that this approach has been widely copied since then.

Convolutional neural networks consist of several layers of small neuron collections that each look at small portions of an image. The results from all the collections in a layer are made to overlap to create a representation of the entire image. The layer below then repeats this process on the new image representation, allowing the system to learn about the makeup of the image.

Deep convolutional neural networks were invented in the early 1980s. But it is only in the last couple of years that computers have begun to have the horsepower necessary for high-quality image recognition.

SuperVision, for example, consists of some 650,000 neurons arranged in five convolutional layers. It has around 60 million parameters that must be fine-tuned during the learning process to recognize objects in particular categories. It is this huge parameter space that allows the recognition of so many different types of object.

Since 2012, several groups have significantly improved on SuperVision’s result. This year, an algorithm called GoogLeNet, created by a team of Google engineers, achieved an error rate of only 6.7 percent.

One of the big challenges in running this kind of competition is creating high-quality dataset in the first place, say Russakovsky and co. Every image in the database has to be annotated to a gold standard that the algorithms must meet. There is also training database of about 150,000 images that also have to be annotated.

That is no easy task with such a large number of images. Russakovsky and co have done this using crowdsourcing on facilities such as Amazon’s Mechanical Turk where they ask human users to categorize the images. That requires a significant amount of planning, crosschecking and rerunning when it does not work. But the result is a high quality database of images annotated to a high degree of accuracy, they say.

An interesting question is how the top algorithms compare with humans when it comes to object recognition. Russakovsky and co have compared humans against machines and their conclusion seems inevitable. “Our results indicate that a trained human annotator is capable of outperforming the best model (GoogLeNet) by approximately 1.7%,” they say.

In other words, it is not going to be long before machines significantly outperform humans in image recognition tasks.

The best machine vision algorithms still struggle with objects that are small or thin such as a small ant on a stem of a flower or a person holding a quill in their hand. They also have trouble with images that have been distorted with filters, an increasingly common phenomenon with modern digital cameras.

By contrast, these kinds of images rarely trouble humans who tend to have trouble with other issues. For example, they are not good at classifying objects into fine-grained categories such as the particular species of dog or bird, whereas machine vision algorithms handle this with ease.

But the trend is clear. “It is clear that humans will soon outperform state-of-the-art image classification models only by use of significant effort, expertise, and time,” say Russakovsky and co.

Or put another way, It is only a matter of time before your smartphone is better at recognizing the content of your pictures than you are.

Ref: http://arxiv.org/abs/1409.0575: ImageNet Large Scale Visual Recognition Challenge

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

OpenAI teases an amazing new generative video model called Sora

The firm is sharing Sora with a small group of safety testers but the rest of us will have to wait to learn more.

Google’s Gemini is now in everything. Here’s how you can try it out.

Gmail, Docs, and more will now come with Gemini baked in. But Europeans will have to wait before they can download the app.

Google DeepMind’s new generative model makes Super Mario–like games from scratch

Genie learns how to control games by watching hours and hours of video. It could help train next-gen robots too.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.