A View from Emerging Technology from the arXiv
Machine-Learning Algorithm Ranks the World's Most Notable Authors
Deciding which books to digitise when they enter the public domain is tricky; unless you have an independent ranking of the most notable authors.
Public Domain Day, January 1, is the day on which previously copyrighted works become freely available to print, digitize, modify, or reuse in more or less any way. In most countries, this happens 50 or 70 years after the death of the author.
There is even a website that celebrates this event, announcing all the most notable authors whose works become freely available on that day. This allows organizations such as Project Gutenberg to prepare digital editions and LibriVox to create audio versions, and so on.
But here’s an interesting question. While the works of thousands of authors enter the public domain each year, only a small percentage of these end up being widely available. So how to choose the ones to focus on?
Today, Allen Riddell at Dartmouth College in New Hampshire, says he has the answer. Riddell has developed an algorithm that automatically generates an independent ranking of notable authors for a given year. It is then a simple task to pick the works to focus on or to spot notable omissions from the past.
Riddell’s approach is to look at what kind of public domain content the world has focused on in the past and then use this as a guide to find content that people are likely to focus on in the future. For this he uses a machine-learning algorithm to mine two databases. The first is a list of over a million online books in the public domain maintained by the University of Pennsylvania. The second is Wikipedia.
Riddell’s begins with the Wikipedia entries of all authors in the English language edition—more than a million of them. His algorithm extracts information such as the article length, article age, estimated views per day, time elapsed since last revision, and so on.
The algorithm then takes the list of all authors on the online book database and looks for a correlation between the biographical details on Wikipedia and the existence of a digital edition in the public domain.
That produces a “public domain ranking” of all the authors that appear on Wikipedia. For example, the author Virginia Woolf has a ranking of 1,081 out of 1,011,304 while the Italian painter Giuseppe Amisani, who died in the same year as Woolf, has a ranking of 580,363. So Riddell’s new ranking clearly suggests that organizations like Project Guttenberg should focus more on digitizing Woolf’s work than Amisani’s.
The beauty of this approach is that it is entirely independent. That’s in stark contrast to the committees that are often set up to rank works subjectively.
Of the individuals who died in 1965 and whose work will enter the public domain next January in many parts of the world, the new algorithm picks out TS Eliot as the most highly ranked individual. Others highly ranked include Somerset Maugham, Winston Churchill, and Malcolm X.
As well as by year of death, it’s possible to rank authors according to categories of interest. For example, the top-ranked Mexican poet is Homero Aridjis, the top-ranked French philosopher, Jean-Paul Sartre, and the top-ranked female American writer, Terri Windling.
Riddell says his ranking system compares well with existing rankings compiled by human experts, such as one compiled by the editorial board of the Modern Library. “The Public Domain Rank of the authors selected by the Modern Library editorial board is consistently high,” he says.
It is not perfect, however. Riddell acknowledges that his new Public Domain Ranking is likely to reflect the biases inherent in Wikipedia, which is well known for having few female editors, for example.
But with that in mind, the ranking is still likely to be useful. It should be handy for finding notable authors in the public domain whose works are not yet available electronically because they have somehow been overlooked. “Flannery O’Connor and Sylvia Plath stand out as significant examples of authors whose works might be made available today on Project Gutenberg Canada,” says Riddell. (Canada follows the 50-year rule rather than 70.)
It may even change the nature of Public Domain Day. “Public Domain Rank promises to facilitate—and even automate—Public Domain Day,” says Riddell.
Ref: arxiv.org/abs/1411.2180 : Public Domain Rank: Identifying Notable Individuals with the Wisdom of the Crowd
Become an MIT Technology Review Insider for in-depth analysis and unparalleled perspective.Subscribe today