Skip to Content

Machine-Learning Algorithm Ranks the World’s Most Notable Authors

Deciding which books to digitise when they enter the public domain is tricky; unless you have an independent ranking of the most notable authors.

Public Domain Day, January 1, is the day on which previously copyrighted works become freely available to print, digitize, modify, or reuse in more or less any way. In most countries, this happens 50 or 70 years after the death of the author.

There is even a website that celebrates this event, announcing all the most notable authors whose works become freely available on that day. This allows organizations such as Project Gutenberg to prepare digital editions and LibriVox to create audio versions, and so on.

But here’s an interesting question. While the works of thousands of authors enter the public domain each year, only a small percentage of these end up being widely available. So how to choose the ones to focus on?

Today, Allen Riddell at Dartmouth College in New Hampshire, says he has the answer. Riddell has developed an algorithm that automatically generates an independent ranking of notable authors for a given year. It is then a simple task to pick the works to focus on or to spot notable omissions from the past.

Riddell’s approach is to look at what kind of public domain content the world has focused on in the past and then use this as a guide to find content that people are likely to focus on in the future. For this he uses a machine-learning algorithm to mine two databases. The first is a list of over a million online books in the public domain maintained by the University of Pennsylvania. The second is Wikipedia.

Riddell’s begins with the Wikipedia entries of all authors in the English language edition—more than a million of them. His algorithm extracts information such as the article length, article age, estimated views per day, time elapsed since last revision, and so on.

The algorithm then takes the list of all authors on the online book database and looks for a correlation between the biographical details on Wikipedia and the existence of a digital edition in the public domain.

That produces a “public domain ranking” of all the authors that appear on Wikipedia. For example, the author Virginia Woolf has a ranking of 1,081 out of 1,011,304 while the Italian painter Giuseppe Amisani, who died in the same year as Woolf, has a ranking of 580,363. So Riddell’s new ranking clearly suggests that organizations like Project Guttenberg should focus more on digitizing Woolf’s work than Amisani’s.

The beauty of this approach is that it is entirely independent. That’s in stark contrast to the committees that are often set up to rank works subjectively.

Of the individuals who died in 1965 and whose work will enter the public domain next January in many parts of the world, the new algorithm picks out TS Eliot as the most highly ranked individual. Others highly ranked include Somerset Maugham, Winston Churchill, and Malcolm X.

As well as by year of death, it’s possible to rank authors according to categories of interest. For example, the top-ranked Mexican poet is Homero Aridjis, the top-ranked French philosopher, Jean-Paul Sartre, and the top-ranked female American writer, Terri Windling.

Riddell says his ranking system compares well with existing rankings compiled by human experts, such as one compiled by the editorial board of the Modern Library. “The Public Domain Rank of the authors selected by the Modern Library editorial board is consistently high,” he says.

It is not perfect, however. Riddell acknowledges that his new Public Domain Ranking is likely to reflect the biases inherent in Wikipedia, which is well known for having few female editors, for example.

But with that in mind, the ranking is still likely to be useful. It should be handy for finding notable authors in the public domain whose works are not yet available electronically because they have somehow been overlooked. “Flannery O’Connor and Sylvia Plath stand out as significant examples of authors whose works might be made available today on Project Gutenberg Canada,” says Riddell. (Canada follows the 50-year rule rather than 70.)

It may even change the nature of Public Domain Day. “Public Domain Rank promises to facilitate—and even automate—Public Domain Day,” says Riddell.

Handy!

Ref: arxiv.org/abs/1411.2180 : Public Domain Rank: Identifying Notable Individuals with the Wisdom of the Crowd

Keep Reading

Most Popular

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

OpenAI teases an amazing new generative video model called Sora

The firm is sharing Sora with a small group of safety testers but the rest of us will have to wait to learn more.

Google’s Gemini is now in everything. Here’s how you can try it out.

Gmail, Docs, and more will now come with Gemini baked in. But Europeans will have to wait before they can download the app.

This baby with a head camera helped teach an AI how kids learn language

A neural network trained on the experiences of a single young child managed to learn one of the core components of language: how to match words to the objects they represent.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.