File level parity checks
Bill Bogstad
bogstad-e+AXbWqSrlAAvxtiuMwx3w at public.gmane.org
Thu Jan 8 12:45:52 EST 2009
On Wed, Jan 7, 2009 at 12:57 PM, Doug <dougsweetser-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org> wrote:
>[concerns about backing up corrupted files...]
I've been thinking about this as well. What the tool should do
depends on what you are trying to accomplish. Do you want to detect
file corruption OR do you want to be able to correct corruption as
well? The tool you mention par2 does both at the expense of
additional disk space. To my mind, if you already have a good backup
system in place (i.e. you have multiple copies of all your data) all
you really need is a way to detect corruption. Corruption of the
primary copy is paramount as errors there will eventually filter into
your backups. Any tool that does file checksums and comparisons can
potentially be used for this purpose. Fortunately, there are already
a number of tools out there which do this which are normally used for
instruction detection as a result of security problems.
I ended up picking the aide package for this purpose. I have it setup
to run every night and email the list of files whose checksums have
changed since the previous run. This took a little tweaking of its
configuration file, but it is doable. To save your sanity you really
want it to ignore file deletion and new file creation. As it is,
there is still too much noise as a result of browser cache directory
index files, etc. Because of this I have aide totally ignore certain
directories where files change frequently and I don't care. I still
get a certain amount of noise due to .gconf directories, etc.; but it
isn't bad. I could have set it up to only check certain directories,
but I felt that scanning everything and ignoring was better then
having to remember to add new entries to the configuration file every
time I created a new directory. I'm only doing this on my primary
(active copy) at the moment.
Currently I have over 435,000 files (about 90 Gbytes of files) being
monitored in my home directory and each copy of the aide database
takes up about 70 Mbytes. I generally keep a week or two of databases
in case I notice something odd. So far this hasn't happened. Oh,
the cron job takes about two hours every night to run the
checksum/comparison. You could say I'm trading off the CPU time for
the nightly comparisons vs.the additional disk storage that par2 would
require. OTOH, par2 would not tell me my data had been corrupted
until I actually went and ran it anyway.
Anyway, I hope this gives you some ideas for one possible way to deal
with these concerns. For permanent backups, you could generate a
static aide database fo and then protect that with par2. Again, this
would be a detection only setup; but still worth it in my opinion.
Bill Bogstad
More information about the Discuss
mailing list