Former Enron executive Vincent Kaminski is a modest, semi-retired business school professor from Houston who recently wrote a 960-page book explaining the fundamentals of energy markets. His most lasting legacy, however, may involve thousands of e-mails he wrote more than a decade ago at the energy-services company.
Kaminski, a former managing director for research who warned repeatedly about concerning practices he saw at Enron, is among more than 150 senior executives whose e-mail boxes were dumped onto the Internet by the Federal Energy Regulatory Commission (FERC) on March 26, 2003. In the name of serving the public’s interest during its investigation of Enron, the federal agency made the controversial decision to post online more than 1.6 million e-mails that Enron executives sent and received from 2000 through 2002. FERC eventually culled the trove to remove the most sensitive and personal data, after receiving complaints (see PDF). Even so, the “Enron e-mail corpus,” as the cleaned-up version is now known, remains the largest public domain database of real e-mails in the world—by far.
This corpus is valuable to computer scientists and social-network theorists in ways that the e-mails’ authors and recipients never could have intended. Because it is a rich example of how real people in a real organization use e-mail—full of mundane lunch plans, boring meeting notes, embarrassing flirtations that revealed at least one extramarital affair, and the damning missives that spelled out corruption—it has become the foundation of hundreds of research studies in fields as diverse as machine learning and workplace gender studies.
This research has had widespread applications: computer scientists have used the corpus to train systems that automatically prioritize certain messages in an in-box and alert users that they may have forgotten about an important message. Other researchers use the Enron corpus to develop systems that automatically organize or summarize messages. Much of today’s software for fraud detection, counterterrorism operations, and mining workplace behavioral patterns over e-mail has been somehow touched by the data set.
“It’s like we are studying yeast,” says William Cohen, a Carnegie Mellon University computer scientist who helped put the corpus in a database that could be mined by researchers. “It’s studied and experimented on because it is a very well understood model organism. [The e-mail generated by] Enron is similar. People are going to keep using it for a long time.”
The Enron e-mails were given their extended life by scientists at MIT, Carnegie Mellon University, and the nonprofit research institute SRI International. Ten years ago, researchers at these institutions were collaborating on the DARPA-funded CALO project, which stands for “Cognitive Assistant that Learns and Organizes,” and whose biggest claim to fame is giving rise to Apple’s Siri software. For CALO, the researchers were cobbling together much smaller e-mail data sets to analyze.
When the Enron e-mails were posted in 2003, the researchers realized that they could be extremely useful for testing algorithms that could process written language and form the basis of intelligent workplace tools. Because FERC had posted the e-mails in an unusable format, MIT’s Leslie Kaelbling purchased the raw files from a government contractor for $10,000, and others spent time cleaning up the data—weeding out duplicates, organizing folders, taking out the remaining private attachments and e-mails, and mapping the senders and recipients to Enron’s organizational structure. The corpus, at first more than 517,431 e-mails, was whittled down to 200,000 by 2004.
A research ecosystem still blooms around the corpus because there is nothing else like it in the public domain. If it didn’t exist, research into business e-mails could be done only by people with access to big corporate or government servers. That probably would exclude social science, organizational, and linguistics researchers—many of whom have used the corpus to glean valuable insights into corporate culture, says Owen Rambow, a Columbia University professor involved in a research project that used the Enron corpus and received a $510,000 grant from the National Science Foundation.
Since 2010, about 30 papers a year have cited the original paper that presented the Enron corpus, Carnegie Mellon’s Cohen estimates. This year, for instance, researchers at HP Labs turned to the corpus to demonstrate an artificial intelligence program for automatically identifying the commitments people make over e-mail. Jafar Adibi, who worked on an early map of the Enron social network, says he still gets handfuls of inquiries every month, more and more from researchers outside of the United States. There is still an active list-serv devoted to discussing the corpus.
Researchers who have worked with the corpus know there won’t be another Enron. FERC released the e-mails back when the world still had a lot to learn about online privacy. The harms to people mentioned—most of whom were innocent of any wrongdoing at Enron—were quickly apparent. Social security numbers and even bank records were in there. Though much private data has been removed, browsing hundreds of e-mails in Kaminski’s “sent” folder, I found a home phone number, his wife’s name, and an unflattering opinion he held of a former colleague. I also got the sense that he had been long, long overdue for the promotion he received in 2000. At the time the e-mails were first released, Kaminski, the manager of about 50 employees at Enron, said he was most disturbed to see his back-and-forth communications about HR complaints and job candidate evaluations become public. A job candidate he once interviewed got upset after their release.
Today, many people who work in highly regulated industries like finance avoid putting sensitive information in their e-mails. Kaminski, who later served as a managing director at Citigroup, notes that the acronym “LTOL” became popular e-mail lingo in the years following Enron. It stands for “Let’s take this offline.”