Monitoring for NAS data corruption on ext4 with cshatag
One item on my todo list for ages was to put in place some sort of monitoring for silent data corruption on my NAS. This turned out to be surprisingly easy to do, thanks to two programs: cshatag
and my own runner
!
Before we dive in, some background information:
- My NAS (aka “home server”) runs Ubuntu 22.04 LTS.
- Bulk data is stored on an mdraid array of 6 disks. The array uses the ext4 filesystem and is mounted at
/mnt/storage
. cshatag
calculates file hashes and stores them, along with the file’s modification time, in a file’s user xattrs. On subsequent runs, it returns an error if a file’s hash has changed but its modification time hasn’t.runner
is a tool I wrote that makes it really convenient to run tools likecshatag
via cron and get emails (and/or Ntfy notifications) only when failures occur, while logging the complete output on the filesystem for later examination if needed.
Both cshatag
and runner
provide installation directions in their READMEs, so I won’t duplicate them here.
TK: install links
On modern systems, including mine, xattrs are enabled by default on ext4 filesystems, so no changes to /etc/fstab
were needed. And the system is already set up with Postfix and Mailgun to email me output from cron jobs, so no additional configuration was needed there. (runner
also supports sending email directly, in case you can’t or don’t want to set up email systemwide.)
All that’s left is putting together a file in /etc/cron.d
with the right combination of arguments! Here’s what the file /etc/cron.d/cshatag
looks like on my NAS:
SHELL=/bin/bash
RUNNER_LOG_DIR=/var/log/cshatag
#
# cshatag: check for data corruption
# https://github.com/rfjakob/cshatag
#
# Monday: general
05 11 * * 1 root runner -job-name cshatag-general -work-dir /mnt/storage -healthy-exit 0 -healthy-exit 2 -healthy-exit 3 -- ionice -c2 -n7 nice -n19 /usr/local/bin/cshatag -qq -recursive ./general
# Wedsnesday: plex
05 11 * * 3 root runner -job-name cshatag-plex -work-dir /mnt/storage -healthy-exit 0 -healthy-exit 2 -healthy-exit 3 -- ionice -c2 -n7 nice -n19 /usr/local/bin/cshatag -qq -recursive ./plex
# Cleanup logs in RUNNER_LOG_DIR older than 30 days:
00 00 * * * root find "$RUNNER_LOG_DIR" -mtime +30 -name "*.log" -delete >/dev/null
Let’s break down everything on the line for the “Monday: general” job:
runner
-job-name cshatag-general
: specifies the job name used in logs, emails, and notifications.-work-dir /mnt/storage
: work in the/mnt/storage
directory.-healthy-exit 0 -healthy-exit 2 -healthy-exit 3
: consider the exit codes 0, 2, and 3 to be “healthy” exit codes (not errors).cshatag
returns 0 if no errors happened; 2 if “one or more files could not be opened; or 3 if “one or more files is not a regular file.” I don’t want to be notified via email when any of these conditions are encountered.
--
: separatesrunner
’s arguments from the name & arguments for the program it’ll execute.ionice -c2 -n7
: run the program with the lowest priority in the “best effort” IO scheduling class (seeman ionice
).nice -n19
: run the program with low CPU priority./usr/local/bin/cshatag
: the path to thecshatag
binary.-qq
: “quiet2 mode - only report corrupt files and errors.”-recursive
: recursively process the contents of directories../general
: the path (under/mnt/storage
) forcshatag
to process.
The end result is that, every Monday at 11:05am, cshatag
runs recursively on /mnt/storage/general
with low IO and CPU priority. If cshatag
returns an error (indicating data corruption or a failure writing xattrs), I’ll be notified via email.