Earlier this month, a Chinese tech giant quietly dethroned Microsoft and Google in an ongoing competition in AI. The company was Baidu, China’s closest equivalent to Google, and the competition was the General Language Understanding Evaluation, otherwise known as GLUE.
GLUE is a widely accepted benchmark for how well an AI system understands human language. It consists of nine different tests for things like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. A language model that scores highly on GLUE, therefore, can handle diverse reading comprehension tasks. Out of a full score of 100, the average person scores around 87 points. Baidu is now the first team to surpass 90 with its model, ERNIE.
The public leaderboard for GLUE is constantly changing, and another team will likely top Baidu soon. But what’s notable about Baidu’s achievement is that it illustrates how AI research benefits from a diversity of contributors. Baidu’s researchers had to develop a technique specifically for the Chinese language to build ERNIE (which stands for “Enhanced Representation through kNowledge IntEgration”). It just so happens, however, that the same technique makes it better at understanding English as well.
Before BERT (“Bidirectional Encoder Representations from Transformers”) was created in late 2018, natural-language models weren’t that great. They were good at predicting the next word in a sentence—thus well suited for applications like Autocomplete—but they couldn’t sustain a single train of thought over even a small passage. This was because they didn’t comprehend meaning, such as what the word “it” might refer to.
But BERT changed that. Previous models learned to predict and interpret the meaning of a word by considering only the context that appeared before or after it—never both at the same time. They were, in other words, unidirectional.
BERT, by contrast, considers the context before and after a word all at once, making it bidirectional. It does this using a technique known as “masking.” In a given passage of text, BERT randomly hides 15% of the words and then tries to predict them from the remaining ones. This allows it to make more accurate predictions because it has twice as many cues to work from. In the sentence “The man went to the ___ to buy milk,” for example, both the beginning and the end of the sentence give hints at the missing word. The ___ is a place you can go and a place you can buy milk.
The use of masking is one of the core innovations behind dramatic improvements in natural-language tasks and is part of the reason why models like OpenAI’s infamous GPT-2 can write extremely convincing prose without deviating from a central thesis.
From English to Chinese and back again
When Baidu researchers began developing their own language model, they wanted to build on the masking technique. But they realized they needed to tweak it to accommodate the Chinese language.
In English, the word serves as the semantic unit—meaning a word pulled completely out of context still contains meaning. The same cannot be said for characters in Chinese. While certain characters do have inherent meaning, like fire (火, huŏ), water (水, shuĭ), or wood (木, mù), most do not until they are strung together with others. The character 灵 (líng), for example, can either mean clever (机灵, jīlíng) or soul (灵魂, línghún), depending on its match. And the characters in a proper noun like Boston (波士顿, bōshìdùn) or the US (美国, měiguó) do not mean the same thing once split apart.
So the researchers trained ERNIE on a new version of masking that hides strings of characters rather than single ones. They also trained it to distinguish between meaningful and random strings so it could mask the right character combinations accordingly. As a result, ERNIE has a greater grasp of how words encode information in Chinese and is much more accurate at predicting the missing pieces. This proves useful for applications like translation and information retrieval from a text document.
The researchers very quickly discovered that this approach actually works better for English, too. Though not as often as Chinese, English similarly has strings of words that express a meaning different from the sum of their parts. Proper nouns like “Harry Potter” and expressions like “chip off the old block” cannot be meaningfully parsed by separating them into individual words.
So for the sentence:
Harry Potter is a series of fantasy novels written by J. K. Rowling.
BERT might mask it the following way:
[mask] Potter is a series [mask] fantasy novels [mask] by J. [mask] Rowling.
But ERNIE would instead mask it like this:
Harry Potter is [mask] [mask] [mask] fantasy novels by [mask] [mask] [mask].
ERNIE thus learns more robust predictions based on meaning rather than statistical word usage patterns.
A diversity of ideas
The latest version of ERNIE uses several other training techniques as well. It considers the ordering of sentences and the distances between them, for example, to understand the logical progression of a paragraph. Most important, however, it uses a method called continuous training that allows it to train on new data and new tasks without it forgetting those it learned before. This allows it to get better and better at performing a broad range of tasks over time with minimal human interference.
Baidu actively uses ERNIE to give users more applicable search results, remove duplicate stories in its news feed, and improve its AI assistant Xiao Du’s ability to accurately respond to requests. It has also described ERNIE’s latest architecture in a paper that will be presented at the Association for the Advancement of Artificial Intelligence conference next year. The same way their team built on Google’s work with BERT, the researchers hope others will also benefit from their work with ERNIE.
“When we first started this work, we were thinking specifically about certain characteristics of the Chinese language,” says Hao Tian, the chief architect of Baidu Research. “But we quickly discovered that it was applicable beyond that.”
To have more stories like this delivered directly to your inbox, sign up for our Webby-nominated AI newsletter The Algorithm. It's free.
Yann LeCun has a bold new vision for the future of AI
One of the godfathers of deep learning pulls together old ideas to sketch out a fresh path for AI, but raises as many questions as he answers.
Inside a radical new project to democratize AI
A group of over 1,000 AI researchers has created a multilingual large language model bigger than GPT-3—and they’re giving it out for free.
Sony’s racing AI destroyed its human competitors by being nice (and fast)
What Gran Turismo Sophy learned on the racetrack could help shape the future of machines that can work alongside humans, or join us on the roads.
DeepMind has predicted the structure of almost every protein known to science
And it’s giving the data away for free, which could spur new scientific discoveries.
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.