ZFS and block deduplication
Mark Woodward
markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org
Fri Apr 22 11:41:43 EDT 2011
I have been trying to convince myself that the SHA2/256 hash is
sufficient to identify blocks on a file system. Is anyone familiar with
this?
The theory is that you take a hash value of a block on a disk, and the
hash, which is smaller than the actual block, is unique enough that the
probability of any two blocks creating the same hash, is actually less
than the probability of hardware failure.
Now, I know basic statistics well enough to not play the lottery, but
I'm not sure I can get my head around it. On a completely logical level,
assume that you have a block size of 32K and a hash size of 32 chars,
there are 1000 (1024 if we are talking binary 32K) potential duplicate
blocks per single hash. Right? For every unique block (by hash) we have
a potential of 1000 collisions.
Also, looking at the "birthday paradox," since every block is equally
likely as every other block (in reality we know this is not 100% true),
isn't the creator's stated probability calculations much weaker than
assumed?
I come from the old school were "god does not play dice" especially with
storage.
Given a small enough block size with a small enough set size, I can
almost see it as safe enough for backups, but I certainly wouldn't put
mission critical data on it. Would you? Tell me how I'm flat out wrong.
I need to hear it.
More information about the Discuss
mailing list