In a Washington, DC, conference room soundproofed to thwart eavesdropping, five linguists working for the government-speaking on condition their names not be published-describe the monumental task they face analyzing foreign-language intercepts in the age of terror.
Around the table are experts in Arabic, Russian, Chinese, and Italian, and a woman who is one of the government’s few speakers of Dari, a language used in Afghanistan. They are young; three are in their 20s, the others in their 30s and 40s. But they are increasingly vital to U.S. national security: they are the front-line translators analyzing language that is messy, complicated, and fragmented but may give clues to an impending terrorist attack.
“Analyze” is the operative word here-not just “translate.” Poring over documents and audio clips, the five-along with thousands of other government or contract linguists who do similar work-struggle to pull out single words, isolate fragments of information, weave intelligence out of the fragments, and generally perform linguistic triage on the flood of raw material collected daily by the CIA, FBI, Department of Defense, and other sources. One linguist tells of having to dig through a filthy box of documents, reeking of gasoline, that had just come off a plane from Afghanistan. Another describes decoding a handwritten note whose signature-a key clue to its intelligence value-was half ripped off. Another recounts listening to an intercepted cell-phone conversation, in Russian, between two men in a noisy outdoor marketplace. One man was stuttering; that could have indicated he was nervous, which might or might not reflect the importance of the conversation. “It’s like looking at the pieces of a jigsaw puzzle,” without the box-top that shows what the picture is supposed to look like, says the Chinese linguist. “And maybe the pieces don’t fit together. You have to brush off the dust and say, what do I do next?”
Beyond these physical and contextual stumbling blocks, analysts face challenges from the languages themselves. Al-Qaeda members tend to speak an Arabic saturated with cultural and historical allusions; that makes it tough to distinguish religious dialogue from attack plans. And some of the terror group’s members aren’t native speakers of the language, which means they make unusual word choices, pronounce words differently, and commit many grammatical errors. “We have a lot of practice dealing with the Soviet model or the European model of conversation,” but not as much with cultures in which direct, plain language is rare, says Everette Jordan, a former National Security Agency linguist who arranged for the five linguists to meet with Technology Review. “It’s not the where, what, how, and when. It’s the why, and the why not. That’s what we’re encountering a lot.”
The costs of failing to clarify what adversaries mean in a timely manner are high. That was made clear during Congressional investigations into the intelligence lapses that led up to the September 11 attacks. In perhaps the most glaring example, on Sept. 10, 2001, according to June 2002 news reports, the NSA intercepted two Arabic-language messages, one that said “Tomorrow is zero hour” and another that said “The match is about to begin.” The sentences weren’t translated until Sept. 12, 2001. The revelation underscored the fact that the U.S. government faces a serious crisis in its ability to store, analyze, search, and translate data in dozens of foreign languages.
It’s a crisis that’s getting worse-literally, by the hour. The backlog of unexamined material is so large, it’s measured not in mere pages but in cubic meters. Consider that every three hours, NSA satellites sweep up enough information to fill the Library of Congress. And the NSA is only one intelligence agency. Somewhere in that massive haystack might be a needle about two kilobytes in size-the amount of data in a single typewritten page-in which terrorists let slip their plans.
And although there’s a well-reported shortage of qualified translators to help search for that needle, there’s a systemic problem, too. This deluge of intelligence is absorbed by a federal intelligence-gathering bureaucracy that is sprawling and balkanized. Four branches of the military, 13 intelligence agencies, and the State Department’s diplomatic corps all have their own creaky systems built up over decades. Each agency houses-some say hoards-its own set of translators, analysts, and databases. Indeed, well before September 11, experts knew that the government’s translation infrastructure wasn’t only overwhelmed; it was obsolete. But the attacks provided the motivation to rethink, from the ground up, how translation gets done. “We’re going through a cultural change right now,” the Chinese linguist says. “We have to find the tools for the job.”
A New Thrust
The locus of this cultural change-and shift in technology strategy-is several floors of an inconspicuous office building in downtown Washington, not far from the FBI’s headquarters. This is the seat of the National Virtual Translation Center, a new federal office, created by the USA Patriot Act in 2001 but only funded in 2003. Its budget is secret, and last fall, most of its brand new cubicles stood empty. In one room, boxes that once held Dell computer monitors were stacked against a wall; in another room, Russian, Arabic, and Swahili dictionaries were still shrink-wrapped.
But the humdrum setting belies the center’s pivotal role in transforming the U.S. government’s approach to translation and analysis. It will act as the hub of a translation web serving all federal intelligence agencies. This year it’s in the process of hiring perhaps 300 in-house linguists, but more significantly, over the next three to five years it will link tens of thousands of government linguists and private contractors via secure network connections of a type already used by the FBI and the CIA.
Its basic operating idea-break down bureaucratic walls and keep human translators at the center of the enterprise-stands in contrast to the government’s traditional approach to the translation problem. Since the late 1940s, the U.S. government and its research agencies have spent huge sums trying to build the ultimate spy computer, which would automatically translate any sentence in any language, whether spoken or written, into graceful English. These efforts have provided some limited tools but haven’t delivered on the larger vision. Computers simply aren’t very effective at decoding language complexities that humans easily interpret.
The new translation center represents a major shift in the kinds of translation technologies that the government seeks to develop. In essence, the dream of universal machine translation is being pushed aside, in favor of a fresh thrust-one in which a variety of tools are developed, not to replace humans, but to assist them. “It’s a model for how the government will deal with foreign language in the 21st century,” says William Rivers, the assistant research director of the Center for the Advanced Study of Language at the University of Maryland.
For example, say Rivers and others, forget about software that can translate printed Arabic: analysts would benefit enormously if software could simply make Arabic writing easier to read, so that recovered documents could be more efficiently processed. “The government is appallingly behind on computer-assisted translation, because they’ve invested all this money on machine translation,” says Kevin Hendzel, a former Russian linguist for the White House and now chief operating officer at Aset International Services, a translation agency in Arlington, VA.
For some old language hands, the new center is the realization of a long-held vision. “Ever since the 1970s, when the first PCs became available online, we’ve thought about how to link inexperienced translators to online dictionaries and expert assistance,” says Glenn Nordin, a language intelligence official at the Department of Defense. “The [National Virtual Translation Center] is a dream of 20 years come true.”
If it works, the center will tie together emerging translation-assistance technologies and deploy them efficiently on a massive scale. Administered from the DC offices, the translation web will also leverage the skills of people spread all over the U.S.: professors, contract translators, government linguists-greenhorn 20-somethings and retirees alike. “I can still reach the government [linguist] who has retired to Pocatello, Idaho, so we don’t lose those skills out the door,” says Jordan, who is the center’s new director. “Right now, that’s our only option.”
A Translation Web for National Security
The new National Virtual Translation Center aims to rapidly process foreign-language intelligence by forming a secure communications network and providing assistance technologies to human translators. In this hypothetical example, documents captured in Afghanistan are processed with the help of technologies from scanning software to shared databases of translated phrases.
1. U.S. operatives capture a box of waterlogged documents in Kabul, Afghanistan, and use next-generation scanning software to digitize them. The digitized documents are sent to the Defense Intelligence Agency at the Pentagon.
2. Faced with an overload, the Defense Intelligence Agency farms the job to the National Virtual Translation Center in Washington, DC. There, analysts use advanced software to flag important names and terms.
3. Most of the documents are free of ominous language, but one contains the word “fermenter.” The center forwards this document to a retired Arabic translator with bioweapons expertise living in Idaho.
4. The Idaho translator does a partial translation using databases called translation memories that store common phrases. He determines that the document discusses pharmaceuticals and does not indicate a threat.
A suite of technologies-tools that can digitize, parse, and digest raw material for that translator in Pocatello-are at the heart of the center’s efforts. From scanning software that recognizes more languages, to better databases that facilitate searches for translated phrases, to Web-based collaboration methods, the technological focus is on helping the linguist on the front lines.
The first step is simply to give the translator his or her work in digital form. “It’s not only making the text machine-readable, but machine-tractable,” says Jordan. “If you want to highlight something in the text and send it off to an offline dictionary or glossary, the machine really needs to be able to grab hold of the text and move it around.” That requires digitizing printed documents so they can be handled by word-processing software.
Yet the enabling technology-optical character recognition software that recognizes printed characters and renders them in a standard digital font-leaves a lot to be desired. For one thing, it has been developed mainly for the Roman alphabet and the major European languages; for another, it follows language-specific rules and looks for distinct symbols-the letter R, for instance-whose boundaries and shape it has been programmed to recognize. While such technology exists for Arabic (and languages like Dari and Pashto, which are written in variations of the Arabic alphabet), it’s reliable only on clearly printed-not smudged, blurry, or handwritten-documents.
This technological void slows the process of adding new languages to the scanning software, requiring a human expert to write out new rules. On the ground, it makes translators’ work difficult from the start, often forcing them to do their work with pen and paper. “If the technology isn’t up to date, it could be you and your dictionary,” says the Italian translator.
Improving the situation requires a new approach. A research group at BBN, a Verizon subsidiary based in Cambridge, MA, has developed a more flexible, trainable version of optical character recognition. For example, instead of just looking for an R, the system looks for a range of shapes that might be Rs and hunts for possible matches in a list of models created during its training. This approach is more effective with blurry text and can be adapted to a wider range of languages, says Prem Natarajan, the technical lead on the project. Indeed, he says, the system has a high rate of accuracy with seven writing systems-Chinese, Japanese, Arabic, and Thai among them. The project already gets federal funding; its end product may be deployed by the new translation center.
With the BBN technology, documents in more languages could be scanned on the spot-say a cave in Afghanistan-and beamed directly to a translator. But getting the text in digital form is only the start. Another step in the virtual-translation center’s technology assistance plan is to find and deploy software that can look for key words or phrases and flag suspicious documents for a closer look. These technologies can be applied to both digital text files and transcripts of voice intercepts.
It’s not a new idea. But again, the existing software works mostly with European languages, for which private business markets have long provided the motivation for commercial development. The problem is that not all languages have structures and “words” that can be searched for in the same manner as those of, say, English or French. Arabic words, for instance, are generally written without vowels, and they include grammatical elements, such as markers for plurals, that make this rough translation process-often called “gisting”-more difficult. But one new gisting tool will help do rough terminology searches on materials in Arabic. It’s called the Rosette Arabic Language Analyzer-developed by Basis Technology, a multilingual-software company based in Cambridge, MA-and it improves searches by regularizing the spelling of Arabic words and removing confusing grammatical additions. The tool does the brute, repetitive work and lets humans spend more time actually analyzing the information.
After a document has been digitized and gisted, a linguist can grapple with it using another emerging assistance technology: translation memory. A translation memory works sort of like a spell-check application; it selects a chunk of text-whether a word fragment, several words, or whole sentences-and matches that chunk against previously translated material, saving time and improving accuracy by providing at least a partial translation. It’s already a key tool in the medical and legal industries-where the same jargon frequently crops up in different languages.
Now, translation memory technology is being applied to intelligence work, which involves reading documents that are anything but formatted. Trados of Sunnyvale, CA-the biggest manufacturer of translation memory software-hopes to provide analysts with customized translation memories that work with summarized texts, not just full translations, says Mandy Pet, a Trados vice president. Last year Trados announced it was providing its translation memory software to the FBI’s Language Division, which will be working closely with the new center.
Tying all these tools together will be an Internet-based system that will allow the new center to quickly dispatch projects to analysts in the field and will help those far-flung analysts more rapidly and accurately collaborate on the same projects. The precise, highly secure Web architecture that will allow this kind of collaboration is still under construction. Kathleen Egan, the translation center’s technical director, says part of the challenge is to ensure that individual federal intelligence agencies keep their secrets-not only from hackers and terrorist infiltrators, but also from other federal agencies. This will require modifications of existing Internet collaboration software to allow sharing of some databases while protecting proprietary information.
The most important question about the new translation center remains: will it have a noticeable impact on the nation’s translation crisis? Some observers are skeptical. Robert David Steele, a former CIA officer who is now an intelligence community gadfly, puts it bluntly: “The FBI will fail because they lack the mindset to understand networks, translators without security clearances, and ad hoc contracting.” He predicts the center will join other grand federal efforts that proved to have dubious value (see “DC’s Digital Dysfunction,” sidebar). Meanwhile, other experts still see the future as lying in machine translation-and are working hard, often with government funding, to realize that vision. The new translation center “is the only way to go in the short run. But they may have to revisit that decision when technology overtakes it,” says Jaime Carbonell, a computer scientist at Carnegie Mellon University and the chief scientific advisor for Meaningful Machines, a New York City startup developing machine translation products based on advanced statistical methods.
Whether or not automated computing tools overtake the more garden-variety helpers, though, the new center’s role should be key. And its efforts could have a number of payoffs. The translation assistance technologies shepherded by the center could improve the U.S. government’s ability to deal with information that’s written in non-Roman alphabets, which will speed visa applications, passport checks at borders, and even tax returns. They could also make translation by corporations cheaper and faster. For example, better optical character recognition means cost savings for companies that read forms, such as bank checks or standardized tests. Large organizations like the World Bank and the United Nations have huge stockpiles of multilingual documents they want to digitize and put online.