Skip to Content

Data Mining Reveals Historical Events in Government Archive Records

The course of history is often hidden in government archives. Now statisticians have worked out how to extract the most significant events using data-mining techniques.

The study of political history is undergoing a revolution. The driving force behind this change is the availability of electronic records of government activity and the ability to number-crunch this data in ways that have never before been possible.

Historians have immediate access to contemporary news reports of important world events. But these reports need to be calibrated by comparing them to government records of the time, which may differ in important and unknown ways. But these records are not usually made public until some years later. The sheer number of government documents makes this process difficult since it is all too easy for researchers to overlook something significant.

So an automated way of searching these government records for significant events would be hugely useful.

Data mining reveals the spike in government communications during fall of Saigon (top) but little of significance related to Finland (bottom).

Today, Yuanjun Gao at Columbia University in New York and a few pals say they have developed a set of statistical tools that can do this job. They have tested the technique on U.S. government electronic records released by the National Archives and compared the results with contemporary news records.

The work is made possible by the fact that the U.S. State Department has been storing classified communications electronically since 1973. Many of the records dating from 1973 to 1977 are now publicly available and consist of 1.4 million declassified cables and the metadata associated with 0.4 million documents delivered by diplomatic pouch.

These electronic records are all assigned tags related to their topic. So communications related to South Vietnam are tagged VS, those related to the United Nationals General Assembly tagged UNGA, to Finland tagged FI, and so on.

That allows researchers to track communications related to specific topics without knowing the details of the messages themselves.

Plotting the number of messages over time reveals some interesting patterns. For example, it reveals an increase in communications relating to South Vietnam in April 1975 at the time of the fall of Saigon.

Communications related to the UN general assembly spike at regular intervals corresponding to the dates of the assemblies which occur every year. This pattern includes an additional spike in April-May 1974 which corresponds to a special UN session called for by Algeria to discuss its demands for support for a “New International Economic Order.”

By contrast, the tags associated with Finland between 1973 and 1977 show no spikes or special pattern, reflecting the stability of the nation at that time.

Major spikes are relatively easy to see visually, but Gao and co have developed tools to spot them automatically and rank their significance by comparing them to background level of activity at the time.

That allows the team to produce a ranking of the top 30 most significant “events” during this period.

Not all of these spikes correspond to important world events. For example, the top two most significant spikes relate to administrative issues such as transport and changes to the visa records system.

But the ranking finds a wide range of other important events, such as the Carter administration’s prioritization of human rights, the president of Egypt, Anwar Sadat’s surprise visit to Israel in 1977, the Southeast Asian “Boat People” crisis of 1975-76, the 1973 Yom Kippur War and Portugal’s withdrawal from Angola in 1975-76, and so on.

That’s interesting work that should change the way historians and social scientists do their work. And this is just the beginning. These kind of automated methods can also show how humans used the tagging process at the time, how often mistakes were made, and whether certain types of messages may have gone missing.

And as more detailed electronic records are released, researchers will be able to develop more advanced ways of mining them. Political history will never be the same.

Ref: arxiv.org/abs/1712.07319: Mining Events with Declassified Diplomatic Documents

Keep Reading

Most Popular

transplant surgery
transplant surgery

The gene-edited pig heart given to a dying patient was infected with a pig virus

The first transplant of a genetically-modified pig heart into a human may have ended prematurely because of a well-known—and avoidable—risk.

open sourcing language models concept
open sourcing language models concept

Meta has built a massive new language AI—and it’s giving it away for free

Facebook’s parent company is inviting researchers to pore over and pick apart the flaws in its version of GPT-3

Muhammad bin Salman funds anti-aging research
Muhammad bin Salman funds anti-aging research

Saudi Arabia plans to spend $1 billion a year discovering treatments to slow aging

The oil kingdom fears that its population is aging at an accelerated rate and hopes to test drugs to reverse the problem. First up might be the diabetes drug metformin.

images created by Google Imagen
images created by Google Imagen

The dark secret behind those cute AI-generated animal images

Google Brain has revealed its own image-making AI, called Imagen. But don't expect to see anything that isn't wholesome.

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.