Innovation News

Computers Learn New ABCs

  • September 2003
  • By Michael Erard

Efforts to encode the world's written languages will enable a truly global Internet.

   

For tens of millions of people around the world-from West Africa to Southeast Asia to the Middle East-the Internet's not such a friendly place. That's because many of the world's writing systems still aren't encoded in software, which means millions of people can't write e-mail, build Web sites, or search databases in their native scripts. A group of linguists at the University of California, Berkeley, is trying to change that, by making sure that nearly 100 additional scripts have a place in a crucial international standard that lets computers render, process, and send text data.

The university's initiative "is an effort to rectify an oft-overlooked aspect of the digital divide: many scripts used by languages of under five million speakers in the world today are not represented in the international standard," says Deborah Anderson, a linguist at Berkeley who leads the effort. That standard is called Unicode, which assigns a unique ID number to every written character, symbol, and punctuation mark in a written language. The ID numbers mean that characters won't get misinterpreted as data move between software programs or across the Internet-a problem that sometimes shows up as a string of question marks on your screen and can cripple the ability of whole populations to communicate via the Internet. For example, Unicode is enabling radical economic transformations in Vietnam. Before this year, computer and software manufacturers had come up with 43 different ways to encode Vietnamese text, which meant computers couldn't reliably swap data. Then, early this year, the Vietnamese government adopted Unicode as its national standard.

The problem is that the more obscure writing systems are not yet encoded in the Unicode standard. Adding another 100 scripts is a big task; only 52 are encoded today. To do the job, Berkeley is recruiting and funding linguists, as well as users of scripts like N'Ko (used in West Africa), Balinese (used in Indonesia), and Tifinagh (used in parts of Northern Africa), to determine how many characters each script contains, design fonts, and guide proposals through a bureaucratic maze of government agencies and computer standards bodies. The benefit will be visible to Internet users like Mamady Doumbouya, a Philadelphia publisher who would be able to offer an online version of his newspaper in N'Ko for the first time. "Without Unicode, it takes so much to set up your computer to read a newspaper in N'Ko," Doumbouya says.

Such changes won't happen overnight. Anderson estimates that the project, launched last year, will take 10 years to complete. Until recently, computer companies sustained the encoding effort, but their interest is dwindling because users of unencoded alphabets represent too small a market. The Berkeley project is part of a larger effort to make the Internet more globally available; already the World Wide Web Consortium has made it possible to register domain names in these new scripts, meaning, among other things, that the URLs of Web sites can reflect the writing systems of the people who own them.

 

To read the entire article you must log in:

Most of our content — all daily news, blogs, and videos — is free. Magazine stories are paid. To read this story, you must have a subscription or you must use a reading credit. Registration to Technology Review is free and entitles registrants to three free reading credits.

Username or REGISTER
Password  
   
 
Advertisement

MAGAZINE

Can We Build Tomorrow's Breakthroughs?

Manufacturing in the United States is in trouble. That's bad news not just for the country's economy but for the future of innovation.

Videos

Meet 2011 TR35 Winner Jesse Robbins

More

Advertisement

Technology Review Lists

TR50

Our list of the 50 most innovative companies, including the following:

Cotendo

Groupon

1366 Technologies

iRobot

More

Advertisement

Facebook

Advertisement