De-Duplicating Software: An Introduction

One of the most useful tools in managing electronic records is de-duplication software. Is it right for your government?

In short, de-duplication software can be used to analyze electronic records to determine if there are duplicates in a drive or folder. There are countless versions of this class of software available online. Some will charge, while others are available as freeware.

The features of de-duplication software will vary, but they can include:

  • Ability to search for duplicates by filename
  • Ability to analyze files byte by byte
  • Ability to analyze files pixel by pixel (for pictures)
  • Option to review files before any action is taken
  • Abilty to delete files within the application

Why might de-duplication software be useful for your records management program?

De-Duplication software will allow you to identify the locations of convenience copies. Drive mapping is another super useful tool for records management but will not tell you if there are duplicate files.

Say you’ve got a file structure that beautifully outlines where your government’s Health and Wellness Committee records are stored. You even give the folder containing the records a retention conscious name like, “Committee Records – GR1000-54 +2 years.” Your government does regular disposition and your records management operation is running like a well-oiled machine. You run a version of a de-duplication software and discover that not only are the active Health and Wellness Committee records being stored outside your file structure, but committee records going back several years as well – records that you thought you had deleted! As we all know, if you have a record that is responsive to an open records request, you must produce it even if it has met retention.

This is Beebe and Donna. They may be roughly 80% similar, but certainly not duplicates!

This class of software not only locates duplicates but will give you the option to delete unwanted copies. It’s a powerful tool in this respect, which is why you should only use a version that allows you to review duplicate files. For instance, the software may identify two files as duplicates that are 95% similar. After reviewing the findings, you discover that one is the working paper GR1000-41a (5) for an Annual Report GR1000-41a (1) that must be submitted to a state agency. From TSLAC’s perspective, these are two separate records that each have their own retention requirements – even though artificial intelligence thinks they are the same.

How a local government or state agency can get the most out of de-duplication software:

  • Identify where convenience copies are being stored on your shared drive. Consider adding shortcuts to the folder containing the record copy in these secondary – but logical – locations.
  • With consultation with your RMO, liaisons, IT and other stakeholders, make sure that final disposition is truly final disposition by eliminating or transferring all copies of a given record when it has met the end of its life-cycle.

Before implementing any new software, you’ll want to consult heavily with your IT department. De-duplication software is available in countless iterations at various price points (including free). Some types I experimented with – on my home computer – were bloatware or malware. Not good, so be careful out there.

Also, the ease of use of the software has the potential to downplay the serious consequences for your files. Within a few minutes you can identify and delete dozens if not hundreds of duplicate files. Have safeguards implemented and a review process set up to avoid potential over deletion.

For more information on de-duplication software, see TrustRadius’s overview. TrustRadius includes a comprehensive listing of what is available in 2019.

Have any of you used a de-duplication software for your records? If so, please share your experience in comments.

Several years ago I took a trip to Disneyland.


    Leave a Reply

    Your email address will not be published. Required fields are marked *