[Discuss] Deduplication

Kent Borg kentborg at borg.org
Thu Sep 5 00:06:52 EDT 2024


For many years now I've been good about keeping off line backups on 
(encrypted) external disks. I have been backing up my daily computer(s) 
over several generations of said computers. Which means I manage to put 
large amounts of data on big disks the modern way: by collecting and 
storing duplicates of stuff.

I am a big fan of rsync's "--link-dest" feature, so complete backup 
trees actually share common files that didn't change. But sometimes 
copies (I stored those photos twice?) slip in, or things get moved.

So today I ran "duperemove" on a couple volumes, and it scared up some 
non-trivial space. I decided to run it on a third volume.

Nope! It works by telling the kernel to make files that match to share 
the same extents, but that only works for some file systems.

- XFS. yes, I have used that a long time, it is clever enough to CoW any 
changes that are later made, so files that match can later later diverge.

- btrfs, which I have been using recently, because god knows it is heavy 
in the CoW-ing world


But it doesn't work on any of the extN filesystems. I have used XFS on 
my running volumes for a long time, but for backups I guess I stuck 
longer with ext4 and I maybe even earlier ext-s on some disks—but they 
aren't active, so that's okay.

And anyway, backups are backups, not working containers. I'm happy I can 
dedup what I can.


-kb, the Kent who has long worried about bit rot, for backups, and ever 
since disks got big enough to hold lots of idle data, and who is 
reassured that btrfs CRCs both meta-data and data.



More information about the Discuss mailing list