King – Man + Woman = Queen: The Marvelous Mathematics of Computational Linguistics

The ability to number-crunch vast amounts of words is creating a new science of linguistics.

Emerging Technology from the arXivarchive page

September 17, 2015

Computational linguistics has dramatically changed the way researchers study and understand language. The ability to number-crunch huge amounts of words for the first time has led to entirely new ways of thinking about words and their relationship to one another.

This number-crunching shows exactly how often a word appears close to other words, an important factor in how they are used. So the word Olympics might appear close to words like running, jumping, and throwing but less often next to words like electron or stegosaurus. This set of relationships can be thought of as a multidimensional vector that describes how the word Olympics is used within a language, which itself can be thought of as a vector space.

And therein lies this massive change. This new approach allows languages to be treated like vector spaces with precise mathematical properties. Now the study of language is becoming a problem of vector space mathematics.

Today, Timothy Baldwin at the University of Melbourne in Australia and a few pals explore one of the curious mathematical properties of this vector space: that adding and subtracting vectors produces another vector in the same space.

The question they address is this: what do these composite vectors mean? And in exploring this question they find that the difference between vectors is a powerful tool for studying language and the relationship between words.

First some background. The easiest way to think about words and how they can be added and subtracted like vectors is with an example. The most famous is the following: king – man + woman = queen. In other words, adding the vectors associated with the words king and woman while subtracting man is equal to the vector associated with queen. This describes a gender relationship.

Another example is: paris – france + poland = warsaw. In this case, the vector difference between paris and france captures the concept of capital city.

Baldwin and co ask how reliable this approach can be and how far it can be taken. To do this, they compare how vector relationships change according to the corpus of words studied. For example, do the same vector relationships work in the corpus of words from Wikipedia as in the corpus of words from Google News or Reuters English newswire?

To find out, they look at the vectors associated with a number of well-known relationships between classes of words. These include the relationship between an entity and its parts, for example airplane and cockpit; an action and the object it involves, such as hunt and deer; a noun and its collective noun such ant and army. They also include a range of grammatical links—a noun and its plural, such as dog and dogs, a verb and its past tense, such as know and knew; and a verb and its third person plural such as accept and accepts.

The results make for interesting reading. Baldwin and co say that the vectors sums captured in these relationships generally form tight clusters in the vector spaces associated with each corpus.

However, there are some interesting outliers where words have more than one meaning and so have ambiguous representations in these vectors spaces. Examples in the third person plural cluster include study and studies, run and runs, increase and increases, all words that can be nouns and verbs, which distorts their vectors in these spaces.

That’s interesting work that forges along this new path in the study of words and their relationships to each other. “This paper is the first to test the generalizability of the vector difference approach across a broad range of lexical relations,” they say.

An important question that Baldwin and co neglect is how this better understanding might be used in the real world. One obvious answer is to help machines understand human language. Another is to help with language translation.

It’s worth noting that one of the pioneers and driving forces in this field is Google and its machine translation team. These guys have found that a vector relationship that appear in English generally also works in Spanish, German, Vietnamese, and indeed all languages.

That’s how Google does its machine translation. Essentially, it considers a sentence equivalent in two languages if its position in the vector spaces of each is the same. By this approach its traditional meaning is almost irrelevant.

But because of the idiosyncratic nature of language, there are numerous exceptions. It is these that cause the problems for machine translation algorithms.

So finding ways of spotting the ambiguities may provide a useful way to correct these problems.

Ref: arxiv.org/abs/1509.01692 : Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.