A “Bug Fix” That Could Unlock the Web for Millions Around the World

Too many domain names with non-Latin letters are still shut out of the global Internet economy.

Mike Orcuttarchive page

May 9, 2017

Other Means

Companies that do business online are missing out on billions in annual sales thanks to a bug that is keeping their systems incompatible with Internet domain names made of non-Latin characters. Fixing it could also bring another 17 million people who speak Russian, Chinese, Arabic, Vietnamese, and Indian languages online.

Those are the conclusions of a new study by an industry-led group sponsored by the International Corporation for Assigned Names and Numbers (ICANN), the organization responsible for maintaining the list of valid Internet domain names. The objective of the so-called Universal Acceptance Steering Group, which includes representatives from a number of Internet companies including Microsoft and GoDaddy, is to encourage software developers and service providers to update how their systems validate the string of characters to the right of the dot in a domain name or e-mail address—also called the top-level domain.

The bug wasn’t an obvious problem until 2011, when ICANN decided to dramatically expand the range of what can appear to the right of the dot (see “ICANN’s Boondoggle”). Between 2012 and 2016, the number of top-level domains ballooned from 12 to over 1,200. That includes 100 “internationalized” domains that feature a non-Latin script or Latin-alphabet characters with diacritics, like an umlaut (¨), or ligatures, like the German Eszett (ß). Some 2.6 million internationalized domain names have been registered under the new top-level domains, largely concentrated in the Russian and Chinese languages, according to the new study.

Many Web applications or e-mail clients recognize top-level domains as valid only if they are composed of characters that can be encoded using American Standard Code for Information Interchange, or ASCII. The problem is most pronounced with e-mail addresses, which are required credentials for accessing online bank accounts and social media pages in addition to sending messages. In 2016, the group tested e-mail addresses with non-Latin characters to the right of the dot and found acceptance rates of less than 20 percent.

The bug fix, which entails changing the fundamental rules that validate domains so that they accept Unicode, a different standard for encoding text that works for many more languages, is relatively straightforward, says Ram Mohan, the steering group’s chair. The new research suggests that the potential economic benefits of making the fix outweigh the costs. Too many businesses, including e-commerce firms, e-mail services, and banks, simply aren’t yet aware that their systems don’t accept these new domains, says Mohan.

Things are improving, though. In 2014, Google updated Gmail to accept and display internationalized domain names without having to rely on an inconvenient workaround that translated the characters into ASCII. Microsoft is in the process of updating its e-mail systems, which include Outlook clients and its cloud-based service, to accept internationalized domain names and e-mail addresses.

It’s not just about the bottom line, says Mark Svancarek, a program manager for customer and partner experience at Microsoft, and a vice chair of the Universal Acceptance Steering Group. To let millions of people be held back from the Internet because “the character set is gibberish to them” is antithetical to his company’s mission, he says.

Acceptance of non-ASCII domains is likely to spur Internet adoption, since a large portion of the next billion people projected to connect to the Internet predominantly speak and write only in their local languages, says Mohan. Providing accessibility to these people will depend in many ways on the basic assumptions governing the core functions of the Internet, he says. “The problem here is that in some ways this is lazy programming, and because it’s lazy programming, it’s easy to replace it with better programming.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.