Skip to Content
Uncategorized

Computers Learn New ABCs

Efforts to encode the world’s written languages will enable a truly global Internet.
September 1, 2003

For tens of millions of people around the world-from West Africa to Southeast Asia to the Middle East-the Internet’s not such a friendly place. That’s because many of the world’s writing systems still aren’t encoded in software, which means millions of people can’t write e-mail, build Web sites, or search databases in their native scripts. A group of linguists at the University of California, Berkeley, is trying to change that, by making sure that nearly 100 additional scripts have a place in a crucial international standard that lets computers render, process, and send text data.

The university’s initiative “is an effort to rectify an oft-overlooked aspect of the digital divide: many scripts used by languages of under five million speakers in the world today are not represented in the international standard,” says Deborah Anderson, a linguist at Berkeley who leads the effort. That standard is called Unicode, which assigns a unique ID number to every written character, symbol, and punctuation mark in a written language. The ID numbers mean that characters won’t get misinterpreted as data move between software programs or across the Internet-a problem that sometimes shows up as a string of question marks on your screen and can cripple the ability of whole populations to communicate via the Internet. For example, Unicode is enabling radical economic transformations in Vietnam. Before this year, computer and software manufacturers had come up with 43 different ways to encode Vietnamese text, which meant computers couldn’t reliably swap data. Then, early this year, the Vietnamese government adopted Unicode as its national standard.

The problem is that the more obscure writing systems are not yet encoded in the Unicode standard. Adding another 100 scripts is a big task; only 52 are encoded today. To do the job, Berkeley is recruiting and funding linguists, as well as users of scripts like N’Ko (used in West Africa), Balinese (used in Indonesia), and Tifinagh (used in parts of Northern Africa), to determine how many characters each script contains, design fonts, and guide proposals through a bureaucratic maze of government agencies and computer standards bodies. The benefit will be visible to Internet users like Mamady Doumbouya, a Philadelphia publisher who would be able to offer an online version of his newspaper in N’Ko for the first time. “Without Unicode, it takes so much to set up your computer to read a newspaper in N’Ko,” Doumbouya says.

Such changes won’t happen overnight. Anderson estimates that the project, launched last year, will take 10 years to complete. Until recently, computer companies sustained the encoding effort, but their interest is dwindling because users of unencoded alphabets represent too small a market. The Berkeley project is part of a larger effort to make the Internet more globally available; already the World Wide Web Consortium has made it possible to register domain names in these new scripts, meaning, among other things, that the URLs of Web sites can reflect the writing systems of the people who own them.

U.S. national security experts are interested, too. Everette Jordan, head of the National Virtual Translation Center, a newly formed U.S. government office that provides foreign-language resources for the intelligence community, points out that “technologically, we’re deaf, dumb, and blind if we can’t read this stuff.” Soon, though, U.S. security agencies and African newspaper publishers alike could rally to a new standard.

Keep Reading

Most Popular

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

The problem with plug-in hybrids? Their drivers.

Plug-in hybrids are often sold as a transition to EVs, but new data from Europe shows we’re still underestimating the emissions they produce.

Google DeepMind’s new generative model makes Super Mario–like games from scratch

Genie learns how to control games by watching hours and hours of video. It could help train next-gen robots too.

How scientists traced a mysterious covid case back to six toilets

When wastewater surveillance turns into a hunt for a single infected individual, the ethics get tricky.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.