Select your localized edition:

Close ×

More Ways to Connect

Discover one of our 28 local entrepreneurial communities »

Be the first to know as we launch in new countries and markets around the globe.

Interested in bringing MIT Technology Review to your local market?

MIT Technology ReviewMIT Technology Review - logo


Unsupported browser: Your browser does not meet modern web standards. See how it scores »

{ action.text }

New work from students at the University of Hong Kong describes a novel use of neural networks, collections of artificial neurons or nodes that can be trained to accomplish a wide variety of tasks, previously used only in image recognition. The students used a convolutional network to “learn” features, such as tempo and harmony, from a database of songs that spread across 10 genres. The result was a set of trained neural networks that could correctly identify the genre of a song, which in computer science is considered a very hard problem, with greater than 87 percent accuracy. In March the group won an award for best paper at the International Multiconference of Engineers and Computer Scientists.

What made this feat possible was the depth of the student’s convolutional neural network. Conventional “kernel machine” neural networks are, as Yoshua Bengio of the University of Montreal has put it, shallow. These networks have too few layers of nodes–analogous to the layers of neurons in your cerebral cortex–to extract useful amounts of information from complex natural patterns.

In their experiments, the students, led by professor Tom Li, discovered that the optimal number of layers for musical genre recognition was three convolutional (or “thinking”) layers, with the first layer taking in the raw input data and the third layer outputting the genre data.

In each layer (pictured above) a single node, or neuron, “hears” only a tiny portion of the song, about 23 milliseconds. Each node overlaps 50 percent with its neighbors, however, and so in total the many nodes in the neural network hear a little more than two seconds of the song.

While a human might be hard-pressed to identify the genre of a track in so short a time, this particular algorithm does so easily when applied to songs from the standard library used for testing automated genre recognition. However, it fell flat in subsequent tests in which the students exposed it to music outside of the library on which it was trained.

They attribute the failure of their algorithm to work “in the wild” to an insufficiently large training library on which the network learned in the first place. Because their algorithm was able to chew through 240 songs in just two hours, the Hong Kong students say it has the potential to be quite scalable.

Intriguingly, the convoluted neural network on which this work is based was originally inspired by an examination of the cat visual cortex. Cats, being mammals, have visual cortexes not unlike our own. Experiments done in a related species, the ferret, have shown that, in the inverse of what was done in this paper where a visual neural network was applied to a problem in hearing, it’s possible to re-wire a mammalian brain to see with its auditory cortex.

If convoluted neural networks are as flexible as the perceptual systems of mammals on which they are based, why aren’t they being applied to all sorts of other problems of perception in AI?

1 comment. Share your thoughts »

Tagged: Computing, music, hearing, neural networks, visual cortex

Reprints and Permissions | Send feedback to the editor

From the Archives


Introducing MIT Technology Review Insider.

Already a Magazine subscriber?

You're automatically an Insider. It's easy to activate or upgrade your account.

Activate Your Account

Become an Insider

It's the new way to subscribe. Get even more of the tech news, research, and discoveries you crave.

Sign Up

Learn More

Find out why MIT Technology Review Insider is for you and explore your options.

Show Me