A suite of technologies-tools that can digitize, parse, and digest raw material for that translator in Pocatello-are at the heart of the center’s efforts. From scanning software that recognizes more languages, to better databases that facilitate searches for translated phrases, to Web-based collaboration methods, the technological focus is on helping the linguist on the front lines.
The first step is simply to give the translator his or her work in digital form. “It’s not only making the text machine-readable, but machine-tractable,” says Jordan. “If you want to highlight something in the text and send it off to an offline dictionary or glossary, the machine really needs to be able to grab hold of the text and move it around.” That requires digitizing printed documents so they can be handled by word-processing software.
Yet the enabling technology-optical character recognition software that recognizes printed characters and renders them in a standard digital font-leaves a lot to be desired. For one thing, it has been developed mainly for the Roman alphabet and the major European languages; for another, it follows language-specific rules and looks for distinct symbols-the letter R, for instance-whose boundaries and shape it has been programmed to recognize. While such technology exists for Arabic (and languages like Dari and Pashto, which are written in variations of the Arabic alphabet), it’s reliable only on clearly printed-not smudged, blurry, or handwritten-documents.
This technological void slows the process of adding new languages to the scanning software, requiring a human expert to write out new rules. On the ground, it makes translators’ work difficult from the start, often forcing them to do their work with pen and paper. “If the technology isn’t up to date, it could be you and your dictionary,” says the Italian translator.
Improving the situation requires a new approach. A research group at BBN, a Verizon subsidiary based in Cambridge, MA, has developed a more flexible, trainable version of optical character recognition. For example, instead of just looking for an R, the system looks for a range of shapes that might be Rs and hunts for possible matches in a list of models created during its training. This approach is more effective with blurry text and can be adapted to a wider range of languages, says Prem Natarajan, the technical lead on the project. Indeed, he says, the system has a high rate of accuracy with seven writing systems-Chinese, Japanese, Arabic, and Thai among them. The project already gets federal funding; its end product may be deployed by the new translation center.
With the BBN technology, documents in more languages could be scanned on the spot-say a cave in Afghanistan-and beamed directly to a translator. But getting the text in digital form is only the start. Another step in the virtual-translation center’s technology assistance plan is to find and deploy software that can look for key words or phrases and flag suspicious documents for a closer look. These technologies can be applied to both digital text files and transcripts of voice intercepts.
It’s not a new idea. But again, the existing software works mostly with European languages, for which private business markets have long provided the motivation for commercial development. The problem is that not all languages have structures and “words” that can be searched for in the same manner as those of, say, English or French. Arabic words, for instance, are generally written without vowels, and they include grammatical elements, such as markers for plurals, that make this rough translation process-often called “gisting”-more difficult. But one new gisting tool will help do rough terminology searches on materials in Arabic. It’s called the Rosette Arabic Language Analyzer-developed by Basis Technology, a multilingual-software company based in Cambridge, MA-and it improves searches by regularizing the spelling of Arabic words and removing confusing grammatical additions. The tool does the brute, repetitive work and lets humans spend more time actually analyzing the information.
After a document has been digitized and gisted, a linguist can grapple with it using another emerging assistance technology: translation memory. A translation memory works sort of like a spell-check application; it selects a chunk of text-whether a word fragment, several words, or whole sentences-and matches that chunk against previously translated material, saving time and improving accuracy by providing at least a partial translation. It’s already a key tool in the medical and legal industries-where the same jargon frequently crops up in different languages.
Now, translation memory technology is being applied to intelligence work, which involves reading documents that are anything but formatted. Trados of Sunnyvale, CA-the biggest manufacturer of translation memory software-hopes to provide analysts with customized translation memories that work with summarized texts, not just full translations, says Mandy Pet, a Trados vice president. Last year Trados announced it was providing its translation memory software to the FBI’s Language Division, which will be working closely with the new center.
Tying all these tools together will be an Internet-based system that will allow the new center to quickly dispatch projects to analysts in the field and will help those far-flung analysts more rapidly and accurately collaborate on the same projects. The precise, highly secure Web architecture that will allow this kind of collaboration is still under construction. Kathleen Egan, the translation center’s technical director, says part of the challenge is to ensure that individual federal intelligence agencies keep their secrets-not only from hackers and terrorist infiltrators, but also from other federal agencies. This will require modifications of existing Internet collaboration software to allow sharing of some databases while protecting proprietary information.