E-Records Forum 2013: Automating classification of records using High Performance Computers

This is the third post of a multi-part recap of the 2013 NAGARA E-Records Forum.

This year two speakers from the Texas Advanced Computing Center (TACC) came to speak about high performance computers (also called supercomputers) and how they might be useful in records management.


On the first day of the conference, Dr. Bill Barth, Director of High Performance Computing, spoke about Stampede, TACC’s newest supercomputer.  At nearly 10 petaflops (10 thousand trillion floating point operations per second) Stampede is one of the fastest supercomputers in the world.  Stampede currently ranks as the seventh fastest supercomputer, but those measurements were taken before Stampede was fully online, so Dr. Barth expects it to rank number four or five when the next rankings are announced.

And when TACC says supercomputer they mean supercomputer — Stampede takes up two huge buildings at the J.J. Pickle Research Campus in Austin.  One building houses the computer itself while another is dedicated only to powering and cooling the massive system.  But one of the most amazing things about Stampede is that 10% of the system is dedicated to open research.  That means anyone who has a research need can submit an application and if they are chosen they get to use Stampede in their research project.  So what are people using Stampede for, and what does it have to do with records management?

Stampede supercomputer

Just one small part of the Stampede supercomputer

 Data-mining for NARA

On the second day of the conference Dr. Weijia Xu, Manager, Data-Mining and Statistics Group, spoke about how TACC is using their high performance computers to help NARA perform auto-classification on a collection of 1970s U.S. Department of State cables.  Data-mining is a process where computers are used to analyze large amounts of data to find patterns and identify relationships.  Those patterns and relationships are then used to identify historical trends and make predictions about the future.

The NARA project involved studying 450,000 declassified State Department cables (diplomatic messages between the State Department and its consulates and embassies) from 1973-1976.  The goal of the project was to help the archivists at NARA to better understand the collection and thereby provide better access to researchers.  Because of the age of the cables they have all been declassified, but the project involved using a small subset of the cables to teach a computer to determine if a particular cable was originally classified or not.

Overall the computer did a good job of identifying classified cables, but they did find one large problem: so much can change in the diplomatic world so fast that words in cables that indicated classification one year might no longer be classified the next.  They found that using a subset of cables that were just one year off could drop their accuracy rates from above 90% to below 50%.  I asked Dr. Xu if he hoped that this method could someday be used to separate unclassified from classified cables so that the unclassified cables could be released to the public sooner. He stated that that would be a long-term goal of the project, but unfortunately TACC is not allowed access to currently classified cables, so they won’t be able to work on the project if it gets that far.

It seems like every year the E-Records Forum has at least one presentation that goes over the heads of most or all of the attendees and TACC’s two presentations certainly fit the bill this year, but I for one am glad that there are groups like them out there working on high level things that will eventually help people like us do our jobs better.  And I’ll definitely update the blog if my application to use Stampede for ten minutes to play Super Mario Brothers is approved.

One thought on “E-Records Forum 2013: Automating classification of records using High Performance Computers

  1. I enjoyed this post! (as I did the previous posts). Stampede will probably suffer migraine headaches continually in its quest to properly classify/declassify cables. One has to go back to the source and intent of each individual cable. Was the original cable based on one collection source, two or all-source? If the sole source of the information was human, was it factual information, spin, or totally false information. The original cable may consist of totally false information transmitted using outdated crypto-systems and given a false classification by the source. Stampede has its hands full. It’d be fun to conduct research using Stampede in an effort to create a hacker-proof crypto/comm system as a business continuity contingency back-up (maybe a nationwide LAN so to speak).

Leave a Reply

Your email address will not be published. Required fields are marked *