E-Records Forum Takeaways: Making Sense of Big Data

This article is the third in a series of our takeaways from the 2012 NAGARA E-Records Forum. Presentations from the E-Records Forum are available on the NAGARA website.

By Angela Ossar, Government Information Analyst

The Big Data sessions at E-Records were — and I don’t use this word lightly — breathtaking. Heart-pounding.  Exhilarating. Of course, I was only able to understand why these research presentations were so impressive after the NARA employee sitting next to me explained it to me, but now I hope to be able to convey what I did learn in a way that will do it justice.

What do I mean by “big data?” Put simply, it just means extremely large and complex data sets. (Honestly, Wikipedia’s pretty helpful here:  “big data consists of data sets that grow so large and complex that they become awkward to work with using on-hand database management tools.” Source)

The National Science Foundation has sponsored, and NARA has supplemented funding for, two grant projects to analyze big data and better understand the specific challenges of digital preservation and access. Computer scientists in the Image and Spatial Data Analysis Group of the world-class research institution National Center for Supercomputing Applications (NCSA) are studying two issues critical to preserving digital information:  1) What tools archivists need to preserve e-records; and 2) How to provide searchable access to handwritten information — specifically, to the 125 terabytes of data that comprise the digitized 1940 Census.

Project #1: Tools for Preserving Electronic Records

NCSA is helping NARA understand specific e-records preservation challenges. Kenton McHenry, NCSA research scientist, discussed NCSA’s development of tools and services that will help archivists preserve digital information. One huge challenge with digital preservation is the diversity of formats — particularly with 3D data — and software obsolescence. In other words, the records that came out of these systems still exist, but the software needed to read them doesn’t.  So the data must be migrated to formats that allow the files to be accessed.  AND, the migration needs to happen in as few steps as possible, because each time the data is migrated into a new format, some loss may occur.

The tools that NCSA has developed tells archivists and records managers:

  • What types of files are out there;
  • What software created each type of file;
  • What formats the files can be converted to;
  • What software you need to do that conversion;
  • What level of loss will occur during the migration; and
  • What type of loss will occur from one format to another.

The specific tools created for this analysis are the Conversion Software Registry, Polyglot (which actually does the conversion), and Versus.

NCSA has already conducted research into identifying an optimal file format for long term preservation — one that “maximize[s] accessibility while simultaneously minimizing information loss” (Project Description). Now, they have developed tools to tell us not only what level of loss occurs when migrating from format to format, but what type of loss. And they have also developed tools to actually convert information to these optimal file formats, on a “massively scalable manner” — because these are huge data sets that require migration on a massive scale.

Project #2: Searchable Access to Handwritten Records

Did you know that the 1940 Census was recently released?  If you haven’t checked out NARA’s 1940 Census website yet, you absolutely should, but I warn you: it will absorb you completely. I admit to spending almost an hour last night just looking for my grandmother’s name…

…and that’s the problem.  It’s handwritten data; there is no search box.  Certainly this is not a new problem for genealogists; Wendy Duff and Catherine Johnson published an article in volume 66 of American Archivist  in 2003 entitled, “Where Is the List with All the Names? Information-Seeking Behavior of Genealogists” (SAA members may click here to access the journal’s archive). Using in-depth interviews with genealogists for their research, they found that half of the researchers wanted “lists of names, or names indexes, or search engines that retrieved by name” (p. 85) to help them find the names of their family members.

Now that it’s been digitized, researchers are incredibly fortunate to have the entire 1940 Census available online:  all 3.8 million images.  Almost four million images that look like this:

An indexing project is currently underway; any volunteer can download indexing software and manually transcribe the information on each page, then submit that information back to the website. For-profit companies are also paying employees to do this transcription, and the project is expected to be complete in 6-9 months.  It’ll be a boon to researchers to be able to search that information…but, unlike the images themselves, which are freely available through the National Archives, this searchable information will only be available through the companies — for a fee. And 100% accuracy is still not guaranteed.

What if there were a way for computers to just “read” the census and index it for us?  Kenton McHenry gave a second presentation about NCSA’s research into developing tools for “low cost searchable access to digital archives of handwritten forms” (source – which will tell you more about how these tools work; I cannot hope to explain them to you!)

The cynics at our table wondered: How is this different than what the post office does?  So I asked Kenton that question after the session.  He let out a bemused sigh — “We get that all the time.”  He explained that post office sorting is much simpler because of the limited data it searches. The post office’s machine relies on numbers:  first it reads the ZIP code, looking for all of the possible locations within that ZIP code. Then it moves on to the street number, looking for all possible street names associated with those numbers. For messier handwriting, they rely on humans to transcribe the handwritten information.  This technology just wouldn’t work for the Census — the machine couldn’t, for example, produce the name “Rose” based on the relationship “Wife.”

It’s hard to believe these tools now exist, but it’s thrilling that they do. Kudos to the National Archives for enlisting NCSA’s computer scientists so that archivists and records managers may begin to understand the challenges inherent in digital preservation and access.