Monitoring for NAS data corruption on ext4 with cshatag

Two open-source programs, cshatag and runner, make it easy to monitor an ext4 filesystem for data corruption.

⚠️
This post was automatically migrated from my old blogging software, and I have not reviewed it for problems yet. Please contact me if you notice any important issues.

One item on my todo list for ages was to put in place some sort of monitoring for silent data corruption on my NAS. This turned out to be surprisingly easy to do, thanks to two programs: cshatag and my own runner!

Before we dive in, some background information:

  • My NAS (aka "home server") runs Ubuntu 22.04 LTS.
  • Bulk data is stored on an mdraid array of 6 disks. The array uses the ext4 filesystem and is mounted at /mnt/storage.
  • cshatag calculates file hashes and stores them, along with the file's modification time, in a file's user xattrs. On subsequent runs, it returns an error if a file's hash has changed but its modification time hasn't.
  • runner is a tool I wrote that makes it really convenient to run tools like cshatag via cron and get emails (and/or Ntfy notifications) only when failures occur, while logging the complete output on the filesystem for later examination if needed.

Both cshatag and runner provide installation directions in their READMEs, so I won't duplicate them here.

TK: install links

On modern systems, including mine, xattrs are enabled by default on ext4 filesystems, so no changes to /etc/fstab were needed. And the system is already set up with Postfix and Mailgun to email me output from cron jobs, so no additional configuration was needed there. (runner also supports sending email directly, in case you can't or don't want to set up email systemwide.)

All that's left is putting together a file in /etc/cron.d with the right combination of arguments! Here's what the file /etc/cron.d/cshatag looks like on my NAS:

SHELL=/bin/bash
RUNNER_LOG_DIR=/var/log/cshatag

#
# cshatag: check for data corruption
# https://github.com/rfjakob/cshatag
#

# Monday: general
05 11 * * 1  root  runner -job-name cshatag-general  -work-dir /mnt/storage -healthy-exit 0 -healthy-exit 2 -healthy-exit 3 -- ionice -c2 -n7 nice -n19 /usr/local/bin/cshatag -qq -recursive ./general

# Wedsnesday: plex
05 11 * * 3  root  runner -job-name cshatag-plex     -work-dir /mnt/storage -healthy-exit 0 -healthy-exit 2 -healthy-exit 3 -- ionice -c2 -n7 nice -n19 /usr/local/bin/cshatag -qq -recursive ./plex

# Cleanup logs in RUNNER_LOG_DIR older than 30 days:
00 00 * * *  root  find "$RUNNER_LOG_DIR" -mtime +30 -name "*.log" -delete  >/dev/null

Let's break down everything on the line for the "Monday: general" job:

  • runner
    • -job-name cshatag-general: specifies the job name used in logs, emails, and notifications.
    • -work-dir /mnt/storage: work in the /mnt/storage directory.
    • -healthy-exit 0 -healthy-exit 2 -healthy-exit 3: consider the exit codes 0, 2, and 3 to be "healthy" exit codes (not errors). cshatag returns 0 if no errors happened; 2 if "one or more files could not be opened; or 3 if "one or more files is not a regular file." I don't want to be notified via email when any of these conditions are encountered.
  • --: separates runner's arguments from the name & arguments for the program it'll execute.
  • ionice -c2 -n7: run the program with the lowest priority in the "best effort" IO scheduling class (see man ionice).
  • nice -n19: run the program with low CPU priority.
  • /usr/local/bin/cshatag: the path to the cshatag binary.
  • -qq: "quiet2 mode - only report corrupt files and errors."
  • -recursive: recursively process the contents of directories.
  • ./general: the path (under /mnt/storage) for cshatag to process.

The end result is that, every Monday at 11:05am, cshatag runs recursively on /mnt/storage/general with low IO and CPU priority. If cshatag returns an error (indicating data corruption or a failure writing xattrs), I'll be notified via email.