In Depth - File Fixity & Data Integrity

Summary/Best Practices

Verifying a file's fixity is typically accomplished by generating, recording, and monitoring a checksum. Running a checksum is process that maps data to a bit string of a fixed size in the form of a cryptographic hash. This hash is stored and compared during file movements and can be used to verify the integrity of files in a storage environment. Checksums are important in that they establish a chain of trust between the object and the filesystem.





Step by Step


Level 0 to Level 1

  • Check file fixity on ingest if it has been provided with the content
  • Create fixity info if it wasn’t provided with the content

For level 1, the recommendation is to verify fixity upon ingest into your repository or movement into your system. This can only be done if fixity information is provided for the content. It's essentially taking the existing hash value and comparing it to a second hash value generated either after a file is moved or after a period of time has passed.

Verifying MD5 checksums

If fixity information does not exist for a file, it is important to establish that trust level from this point forward. Creating fixity information for a file can be accomplished a number of ways.

BagIt is a widely used packaging format developed as part of the National Digital Information Infrastructure and Preservation Program to accurately transfer digital content across networks and file systems.

The BagIt specification is organized around the notion of a "bag". A bag is a named file system directory that minimally contains:

  • a "data" directory that includes the digital content being preserved. Files can also be placed in subdirectories, but empty directories are not supported
  • at least one manifest file that itemizes the filenames present in the "data" directory, as well as their checksums. The particular checksum algorithm is included as part of the manifest filename. For instance, a manifest file with SHA-256 checksums is named "manifest-sha256.txt"
  • a "bagit.txt" file that identifies the directory as a bag, the version of the BagIt specification that it adheres to, and the character encoding used for tag files

The specification allows for several optional tag files (in addition to the manifest). The specification defines the following optional tag files:

  • a "bag-info.txt" file which details metadata for the bag, using colon-separated key/value pairs
  • a tag manifest file which lists tag files and their associated checksums (e.g. "tagmanifest-md5.txt")

On receipt of a bag, BagIt can examine the manifest file to make sure that the files are present and that their checksums are correct. This allows for accidentally removed or corrupted files to be identified. Below is an example of a bag that encloses one file of payload.

Example of a bag

The Library of Congress produced a short video about BagIt in 2009: https://www.youtube.com/watch?v=l3p3ao_JSfo

Bagger Interface


Level 1 to Level 2

  • Check fixity on all ingests
  • Use write-blockers when working with original media
  • Virus-check high risk content

Moving from level 1 to level 2 in the framework means that, in addition to actively creating and verifying fixity for all ingested content, you are selectively using write protection and virus checks for certain content.

A write-blocker, or forensic bridge device, is a computer hard disk controller made for the purpose of gaining read-only access to computer hard drives/media without the risk of damaging the contents or inadvertently altering the metadata. Changes can be caused at the time of connection even if you do not issue an explicit command or action within the operating system. Most manufactures of forensic bridges produce both portable and workstation drive-bay versions. Most devices provide a wide variety of host connections including USB 3.0 & FireWire 800 and support standard hard disk drives and solid-state drives (SSD). There are also write-blockers made specifically for memory cards and USB drives. Starting costs for write-blockers range from $150 to $400 depending on the media type you are working with. Popular write blocker manufacturers include Tableau and Wiebetech.

Write blocker attached to hard drive

When accessioning born digital materials, or any digital file for that matter, it's important to safeguard against computer viruses no matter where the files originate. This is typically done just after the transfer process so that the source files are not altered. ClamAV is the most popular open source antivirus software being used in libraries and archival organizations. The software is part of the "stack" in the Hydra Project, is also an integral part of BitCurator, and is a core microservices within the popular digital preservation system Archivematica. Unfortunately, ClamAV does not include a graphical user interface but is instead run from the command line. A number of third-party developers have written GUIs for the various platforms.

Virus scan log file

macOS:

ClamXav is the most popular ClamAV graphical front-end for macOS. ClamXav can be set up as passive or active: scan only the files you tell it to or your entire hard drive. You can also choose to activate Sentry to monitor your hard drive and scan new files as they arrive. $21 to purchase 2 licenses with an educational discount.

https://www.clamxav.com

Windows:

ClamWin is a popular Windows-based ClamAV graphical front-end.

http://www.clamwin.com/content/view/18/46/

Immunet is another popular Windows-based ClamAV graphical front-end.

http://www.immunet.com

Ubuntu:

ClamTk is a ClamAV graphical front-end for Linux-based hardware

https://dave-theunsub.github.io/clamtk/


Level 2 to Level 3

  • Check fixity of content at fixed intervals
  • Maintain logs of fixity info; supply audit on demand
  • Ability to detect corrupt data - Virus-check all content

In addition to checking fixity before and after transfer, collections of digital files and objects should be checked on a regular basis. There are a range of systems and approaches focused on checking object fixity at regular intervals. This could be monthly, quarterly, or yearly. The more often you check, the more likely you are to detect and repair errors. Most fixity checking software creates file-level or folder-level log files that can be used for directory or project audits. These are simply plain text documents with a list of files and their checksums.

Once you have generated baseline fixity information for files or objects, comparing that information with future fixity check information will tell you if a file has changed or been corrupted. Many so called "diff tools" can compare two log files of same directory and highlight the differences. This fixity information can be used to support the repair of data that has been damaged.

Comparison of folder-level log files using Meld

Continue fixity checks but hang onto log files. If you're using BagIt, then they are called manifest files and are part of what's generated when you run BagIt. Other methods of generating checksums can often produce either file-level or folder-level log files.

Folder-level MD5 checksum log file

Virus checking should be a required component of all digital object accessioning at this level going forward. An appropriate plan should also be developed to deal with infected or suspect files. Most antivirus software will either move them into a quarantine folder for later review or simply delete them.


Level 3 to Level 4

  • Check fixity of all content in response to specific events or activities
  • Ability to replace/repair corrupted data
  • Ensure no one person has write access to all copies

In addition to checking fixity during ingests/transfers and on a regular schedule, it is also recommended to check fixity during the production or digitization processes. Files may move through a number of workstations depending on the level of quality control and additional processes in place. File movement fixity checks ensure that files are transferred intact and unchanged. It is also important to check the fixity of content in response to specific events such as hardware failure or security breaches.

Quality control stage checkSum+ fixity verification

Some file systems such as ZFS and CEPH have built-in fixity checks so that data is checked on a regular basis. If file damage does occur, these systems have storage redundancy and self-healing features that can discern which copy is correct and then overwrite the damaged file(s). If system-level auto checking is not an option, you can compare fixity logs using differencing tools such as Meld, DiffMerge, or Kaleidoscope that can compare log files and highlight the differences in checksum values. You can then work to repair/replace the corrupted file.

ZFS system stats via macOS terminal

Formally establishing administrative roles, responsibilities, and authorizations for your systems is a way to ensure that no one person has write access to all copies of a file. This would safeguard against accidental editing, overwriting, and/or deletion. The three major operating systems, Windows, macOS, and Ubuntu, all have sophisticated file/folder permissions settings.

Windows 7 security settings