This is the final recap from the ARMA Houston 2013 Conference & Expo. (presentations available here)
If We Can Land a Man on the Moon: Meeting the Archivist’s “Go Digital” Recordkeeping Challenge by the End of This Decade
Jason Baron, Director of Litigation at the Office of General Counsel at NARA, gave the keynote address for the 2013 ARMA Houston conference. Baron spoke about the same NARA directive for digital recordkeeping that we had heard about during the E-Records forum here in Austin. However, in this talk, we learned more about the technical side than the policy/training side. He opened by explaining the title of his presentation. He likes to make the title relevant to the city he is visiting, so for Houston, he chose the space race. Interestingly, C. Preston Huff made the same comparison in his talk at the NAGARA E-Records forum. What both men mean is that this goal—that each federal agency will develop and implement plans to manage all permanent records in electronic format by 2019—is as important, daunting, and yet feasible as the goal of landing a man on the moon by the end of the 1960s.
More Data, More Problems
Baron continued by comparing the electronic records to Prague Castle, which is the largest castle complex in the world at nearly 230,000 square feet. The castle is so enormous and complex that it is difficult to know exactly what kind of buildings or how many rooms are there. Similarly, institutions might not know exactly what they have – they might have legacy platforms that are isolated or hidden. Furthermore, it’s difficult for large companies to know exactly how much they have.
Then it was time for the numbers. An IDC report states that there will be 1800 new exabytes of data produced this year. It’s difficult to conceptualize just how much information that is, but 1 exabyte is equivalent to 50,000 years of continuous movies. (Baron and a colleague posted a video on YouTube that throws out more of these outrageous numbers. Baron quipped, “It’s set to trance music….don’t ask.”) Basically, it is more data than we can even begin to comprehend, and what we can comprehend is only the tip of the iceberg. A vast amount of information is hidden underneath the surface of the water. For example, you can’t just go looking for federal email; it’s under the surface. But we can’t leave all that data alone – it still needs to be managed.
The meat of the talk focused on email, which Baron characterized as the “800 lb gorilla of ediscovery.” As we all know, email is the dominant form of professional communication. Thus, when we talk about managing electronic records, we often focus on managing email. This is where we encounter difficulties, which Baron named “Process Optimization Problems.”
Take Users Out of the Equation
The first problem is the transactional toll of user-based recordkeeping scheme, or “as-is” RM. This is a fancy way of saying, “Employees don’t want to take the time and effort to manage records; they would rather focus on the jobs they were hired to do.” To explain this roadblock, he displayed a quote from John Mancini, President of AIIM:
“If by traditional records management you mean manual systems—even if they are computerized—then I would say traditional records management is dead. The idea that we could get busy people to care about our complicated retention schedules, and drag and drop documents into folders, and manually apply metadata document by document according to an elaborate taxonomy will soon seem as ridiculous as asking a blacksmith to work on a Ferrari.”
In other words, it is unrealistic to expect that end-users will have much interest in doing their part to ensure RM compliance. This is where the “capstone approach” to email will come into play, which we learned about during the E-Records forum. By only saving the email of select top-level offices, end-user involvement will be reduced.
Furthermore, NARA is committed to learning more about automating these processes. Baron emphasized that NARA, the Federal CIO Council, and the Federal Records Council will work with the private industry and other stakeholders in order to produce economically viable automated records management tools, and they will share the information they have learned with the RM community at large. (For example, a report called Creating Effective Cloud Computing Contracts for the Federal Government is available at the CIO’s website.)This is why we care about what’s going on at the federal level – they have the resources to research RM techniques and tools that will trickle down to the rest of the RM community.
Sorting through the data
Baron’s second Process Optimization Problem is The Coming Age of Dark Data. By 2017, there will be one billion emails from President Obama, but most of them will be dark. The public won’t be able to go to the library and see those emails. That’s a big problem if we want open and transparent government to be a reality. That volume of email is also problematic when it comes to eDiscovery. Finding relevant emails for litigation is, as we learned during the ARMA Austin Spring Education Seminar, an extremely long and laborious process. Baron explained that “we’re all seduced by keywords,” but keyword search is becoming increasingly ineffective against the rising volume of email. For example, Baron showed an example of a Boolean search string used in a tobacco lawsuit–and it was twelve lines long.
Waiting until you need information and hoping that a Boolean search will find it for you is clearly not a viable strategy when faced with a continuing avalanche of data. One emerging strategy is predictive analytics or predictive coding. It sounds scary and complicated, but the concept behind it is rather simple: humans teach computers how to recognize patterns (in this case, about records), and then humans feed more data to the computer, which then identifies the patterns. It is already used in many industries—for example, the ads you see in Facebook or Gmail that respond to your browsing activities—but is only just starting to be used in eDiscovery. The idea is that, when faced with a million emails (perhaps literally, depending on the case!), you will combine the power of the human mind with the power of computing to find relevant documents. First, you will give fifty emails to one human, fifty to another, etc. Each person will decide whether each one of those fifty emails is relevant or not. This is known as coding. Based on that coding, the computer will figure out the relevance of the rest of the million emails. This technique has been endorsed by United States Magistrate Judge Andrew J. Peck in da Silva Moore v. Publicis Groupe (SDNY Feb 24 2012).
Although Baron spent quite some time talking about email and ediscovery issues, these techniques apply to records management as a whole. There are some government agencies that are starting to use these analytics methods for managing email. They might send emails to the records manager who will categorize the emails, and then the machine will start learning from the records manager. Baron emphasized that this kind of change will certainly be a part of our lives in the coming years. We can resist it and bemoan the fact that we are becoming more dependent on machines, or we can simply go with it and embrace these amazing advances in technology. With the vast amounts of data we are facing, we simply cannot do it all ourselves – we have to rely on software to speed the process along. I, for one, don’t mind the extra help!