Archives Month: Web Archive (WARC) File Format

by Erica Rice

It’s October, which means that it is once again time to celebrate American Archives Month! Throughout this month, the records management assistance unit will be taking a closer look into multidisciplinary issues that require input from both records managers and archivists—collaboration between these two professions is the key to solving many RIM-related obstacles.

First on the docket: web archiving.

When it comes to local government and state agency websites, retention periods will vary depending on the function and content of the information uploaded to the web. If you have a website that needs to be maintained long-term or permanently as a state publication, then you will need to think about how to preserve the availability, readability, and integrity of the electronic record. Keep in mind, a website is a collection of several different kinds of electronic records, and not all websites will be state publications. A lot of the content you upload to the web can be preserved offline in alternate formats, such as PDF reports, videos, and pictures; the copy uploaded to your website may just be a convenience copy of the official record copy.

Challenges

The challenge of website preservation is that it is dynamic—the public can interact with it, web pages often contain multiple different media types (text, photos, video, animations, etc.), the information is constantly being updated, and even the look and feel of a website can vary across web browsers. Simply screenshotting a website is not sufficient. Web archiving requires the use of a preservation format that maintains all hyperlinked content and metadata in its original context. You also want to use a preservation method that allows users to see how the layout and content of a website changed over time; in other words, you can see how a website looked and functioned on a specific date in the past.

The Basic Steps

To archive a website so that it still functions interactively, there are three main steps that need to be performed:

  1. Harvest the website(s) with a web crawler.
Farmer harvesting

2. Store the harvested data in a web archive format, like WARC.

3. View and interact with the archived website using a replay mechanism.

There are third party services that can perform some or all of these steps. The Library of Congress (LOC) uses Heritrix to crawl websites, and it uses OpenWayback to store and replay websites; both of these tools are open and freely available for use. However, use of these tools will require advanced IT support to set up, run, and maintain. Consider the IT support available in your local government or state agency—you may need to consult with a third-party provider, such as the Archive-It service hosted by the Internet Archive. Please note, TSLAC does not endorse any specific vendor or provider.

Toolkit

The LOC recommends WARC (Web ARChive) as the “gold standard” file format for web archiving. Detailed specifications for this file format can be found on the LOC’s Digital Formats Library.

The main selling points of the WARC format are:

  • non-proprietary, open standard format (promotes transparency and reduces risk of obsolescence)
  • international standard (ISO 28500) (widely available and used by many mainstream web crawlers)
  • structured record headers (allows for automated bulk harvesting, duplicate information elimination, and indexing of websites over time)

However, there are some limitations to the WARC format:

  • cannot capture dynamic content like maps that can be zoomed in and out
  • cannot capture streaming content or anything that requires the user to download; WARC is a packaging format that aggregates individual files into one unit, so any actions that need to be taken on an individual file cannot be done within the WARC
  • WARC will not capture a back-end database when the website serves as a graphical front-end (search page)
  • if you are using a public archiving service to crawl and store websites, there is no guarantee that the archived content will be available and accessible perpetually

Weigh the pros against the cons to decide if automated web crawling and archiving is right for your governmental entity. The advantage of using WARC is that you will be able to find many sources of support (sometimes free resources) because it is a ubiquitous preservation format. The disadvantage is that you will have to consider how to preserve dynamic content or any media that is not captured by WARC; you will need a plan to ensure that non-capturable media is preserved offline and/or in an alternative format.

Did You Know?

Many state agency websites in Texas are automatically crawled by TRAIL (Texas Records and Information Locator) for publications, which are deposited periodically. The TRAIL system utilizes Archive-It to maintain WARCs. Not only is this a great way to preserve permanent state records, but it can also be quite a lot of fun to see how state government websites have evolved over time. Check out the RMA unit’s website from 2007—a true blast from the past!

Timeline of SLRM website from 2007 to present day

Members of the public can also archive and replay any website using the Internet Archive, a.k.a. Wayback Machine. Do you remember what Google’s search page looked like in 2007? The Wayback Machine remembers.

Like it? 1