E-Records Forum Takeaways: Email Auto-categorization

This article is the fourth (and final) in a series of our takeaways from the 2012 NAGARA E-Records Forum. Presentations from the E-Records Forum are available on the NAGARA website.

By Angela Ossar, Government Information Analyst

By now, most of us know that there’s no single retention period for email:  it has to be kept according to your records retention schedule, and that means classifying each message according to its content. Creating file plans for email based on the categories of records that are created and saved in an office is one way to meet this challenge.  (And in an effort to help give state and local government a place to start with that, we developed some basic training on establishing file plans and manually classifying email based on a few general categories.)

I remember hearing, in Archivist of the U.S. David Ferriero’s address to the DC 2010 conference, that NARA was developing a tool to actually analyze the content of email, weed out transitory information, and identify the most important messages for long-term preservation (emails documenting things like policy development). I distinctly remember gasping when I heard that — my computer is going to organize my email for me?

Well, we’re not quite there yet.  But Dr. William Underwood, Principal Research Scientist at the Georgia Tech Research Institute, did present some interesting findings on the auto-categorization of email. His research involved — I’m drastically simplifying here — “teaching” his computer how to classify his email, then setting the computer loose to see how it performed.

He began by discussing one already-existing tool that Microsoft Outlook users might recognize: the Rules Wizard. Most email systems have some sort of labeling or sorting function that users can set up to auto-file email with certain subject lines, senders, recipients, etc.

The experiment began with Underwood’s selecting a sample of 577 emails that represented common types of email that he receives. Those emails fell into 6 categories on the University System of Georgia (USG) Records Retention Schedule.  Georgia’s categories are actually similar to a few of the records series from the Texas State Records Retention Schedule and Local Schedule GR, including Administrative Correspondence, General Correspondence, and Transitory Correspondence. Georgia’s schedule further breaks out emails documenting advisory board activities, special events, and computer system maintenance. These were Underwood’s 6 categories.

After selecting this sample, he converted the messages to text files and then preprocessed them to get them ready for the SVM (support vector machine) tool — removed punctuation, digits, and “stop words” like pronouns. He then ran the messages through the SVM to construct 6 classifiers based on the words in the messages — in other words, he taught the machine how he wanted his email to be classified. And then he used the 6 classifiers to categorize the rest of the emails in that original 577-message sample.

The results were startlingly accurate. Only two out of 198 emails were incorrectly classified.

Granted, this experiment is not something we’re going to be able to take back to our offices and implement; I, for one, do not have access to statistical software. But Underwood’s research shows that these support vector machines can be trained as highly accurate email classifiers. And maybe that means that software of this variety is around the corner?

The accuracy of auto-categorization tools can be improved with users’ assistance. The following is a list of ideas Underwood gave for improving auto-categorization:

  • At the time of creation, tag copies of intra-organizational email with filing category.
  • Limit use of classifiers to those email categories specific to an office.
  • Associate specific filing categories with generic retention categories.
  • If a person routinely creates a record in filing category, include the category ID in a template, or in a pull down menu.
  • Use subject line tags to facilitate categorization.
  • Categorize responses to already-categorized messages in the same category as the original.

He acknowledged that things like tagging emails with categories using drop-down menus is something that people grumble about, and it’s true that some users will ignore that step or circumvent it. However, as Underwood pointed out, we do have an obligation to retain our email according to our retention schedules. NARA even requires that email archiving software of federal agencies must “provide for the grouping of related records into classifications according to the nature of the business purposes the records serve” (see: Bulletin 2011-03).

Even if these tools don’t become available — or affordable — to our agencies anytime soon, Underwood’s ideas are worth considering. Many of the ideas just improve productivity!

And in case our email subscribers are wondering how to categorize this message, here’s one free tip for you: delete it!  It’s reference material!  (Don’t worry, I’ll microfilm it for permanent preservation…)