ZFS and block deduplication
Mark Woodward
markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org
Sat Apr 23 15:05:06 EDT 2011
On 04/23/2011 08:31 AM, Edward Ned Harvey wrote:
>> From: discuss-bounces-mNDKBlG2WHs at public.gmane.org [mailto:discuss-bounces-mNDKBlG2WHs at public.gmane.org] On Behalf
>> Of Mark Woodward
>>
>> I have been trying to convince myself that the SHA2/256 hash is
>> sufficient to identify blocks on a file system. Is anyone familiar with
>> this?
> I am intimately familiar with this. And on planet Earth, yes it is
> sufficient. However, the cost of explaining your decision to anyone who is
> skeptical outweighs the cost of enabling verification. So as long as
> there's even a possibility that anybody might challenge your decision, you
> should enable verification. But if you're the boss and nobody will
> second-guess you, then you can safely rely on just the hash.
>
> If you'd like to read more about it, try this search... Google for
> zfs-discuss collision
> http://www.google.com/search?q=zfs-discuss+collision&ie=utf-8&oe=utf-8&aq=t&
> rls=org.mozilla:en-US:official&client=firefox-a
> Mostly in the extremely long thread, "(Fletcher+Verification) versus
> (Sha256+No Verification)"
>
> PS. "Safely" is a relative term. The probability of collision is non-zero,
> but the probability is essentially zero relative to Human Extinction Events,
> which means you can (relatively) safely rely on just the sha256 hash.
>
> If you want to calculate the actual probability of a collision, assume a 4k
> (2^12 bytes) block size (worst case) and every single block is precisely the
> same size (which isn't realistic, but is worst case) and every single block
> is unique (in which case why have you enabled dedup. So again,
> unrealistically evil-clown scenario worst case) and if your data pool size
> is the largest in the world (again worst case) say ... 2PB (2^41 bytes)...
> that would be physically impossible to hold the dedup tables using hardware
> currently available in the world, but again... Evil clown worst case for
> accidental collision. Then the number of blocks is 2^41 / 2^12 = 2^29
> unique blocks.
>
> The formula on wikipedia for the birthday problem is:
> p(n;d) ~= 1-( (d-1)/d )^( 0.5*n*(n-1) )
>
> In this case,
> n=2^29
> d=2^256
>
> Using bc to calculate the answer:
> bc -l
>
> n=2^29
> d=2^256
> scale=1024
> 1-e( ( 0.5*n*(n-1)*l((d-1)/d) ) )
> .0000000000000000000000000000000000000000000000000000000000012446030
> I manually truncated here (precision goes out 1024 places). This is
> 1.24E-60
>
> Notice: There are estimated 1E50 atoms in Earth. So in the evil clown
> worst case for sha256 collision, the probability for collision is about the
> same as randomly selecting the same atom twice consecutively from 1000
> Earths.
>
> Note: I had to repeat the calculation many times in bc, setting a larger and
> larger scale. The default scale of 20, and even 64 and 70 and 80 were not
> precise enough to produce a convergent answer around the -57th decimal
> place. So I just kept going larger, and in retrospect, anything over 100
> would have been fine. I wrote 1024 above, so who cares.
You know, I've read the same math and I've worked it out myself. I agree
it sounds so astronomical as to be unrealistic to even imagine it, but
no matter how astronomical the odds, someone usually wins the lottery.
I'm just trying to assure myself that there isn't some probability
calculation missing. I guess my gut is telling me this is too easy.
We're missing something.
Besides, personally, I'm looking at 16K blocks which increases the
probability a bit.
More information about the Discuss
mailing list