ZFS and block deduplication
Daniel Feenberg
feenberg-fCu/yNAGv6M at public.gmane.org
Mon Apr 25 09:32:50 EDT 2011
On Mon, 25 Apr 2011, Mark Woodward wrote:
> On 04/24/2011 10:52 PM, Edward Ned Harvey wrote:
>>> From: Mark Woodward [mailto:markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org]
>>>
>>> You know, I've read the same math and I've worked it out myself. I agree
>> it
>>> sounds so astronomical as to be unrealistic to even imagine it, but no
>> matter
>>> how astronomical the odds, someone usually wins the lottery.
>>>
>>> I'm just trying to assure myself that there isn't some probability
>> calculation
>>> missing. I guess my gut is telling me this is too easy.
>>> We're missing something.
>> See - You're overlooking my first point. The cost of enabling verification
>> is so darn near zero, that you should simply enable verification for the
>> sake of not having to justify your decision to anybody (including yourself,
>> if you're not feeling comfortable.)
> Actually, I'm using ZFS as an example. I doing something different, but
> the theory is the same, and yes, I'm still using SHA265.
>> Actually, there are two assumptions being made:
>> (1) We're assuming sha256 is an ideally distributed hash function. Nobody
>> can prove that it's not - so we assume it is - but nobody can prove that it
>> is either. If the hash distribution turns out to be imbalanced, for example
>> if there's a higher probability of certain hashes than other hashes... Then
>> that would increase the probability of hash collision.
> True.
>> (2) We're assuming the data in question is not being maliciously formed for
>> the purposes of causing a hash collision. I think this is a safe
>> assumption, because in the event of a collision, you would have two
>> different pieces of data that are assumed to be identical and therefore one
>> of them is thrown away... And personally I can accept the consequence of
>> discarding data if someone's intentionally trying to break my filesystem
>> maliciously.
> I'm not sure this point is important. I trust that SHA256 is pretty darn
> hard to create a collision. I would almost believe that it would be more
> likely that blocks collided by random chance than malice.
>>> Besides, personally, I'm looking at 16K blocks which increases the
>> probability
>>> a bit.
>> You seem to have that backward - First of all the default block size is (up
>> to) 128k... and the smaller the blocksize of the filesystem, the higher the
>> number of blocks and therefore the higher the probability of collision.
> This is one of those things that make my brain hurt. If I am
> representing more data with a fixed size number, i.e. a 4K block vs a
> 16K block, that does, in fact, increase the probability of collision 4X,
Only for very small blocks. Once the block is larger than the hash, the
probability of a collision is independent of the block size.
Daniel Feenberg
> however, it does decrease the total number of blocks by about 4x as well.
>
>
>> If for example you had 1Tb of data, broken up into 1M blocks, then you would
>> have a total number of 2^20 blocks. But if you broke it up into 1K blocks,
>> then your block count would be 2^30. With a higher number of blocks being
>> hashed, you get a higher probability of hash collision.
> It comes down to absolute trust that the hashing algorithm works as
> expected and that the data is as randomly distributed as expected.
>
> I'm sort of old school I guess. The mind set is not about probability,
> it is about absolutes. In data storage, it has always been about
> verifiability and we conveniently address probability of failure as a
> different problem and address it differently. This methodology seems to
> merge the two. Statistically speaking, I think I'm looking for 100%
> assurances, and no such assurance has ever really existed.
>
> Its cool stuff. It is a completely different way of looking at storage.
>
>
>
> _______________________________________________
> Discuss mailing list
> Discuss-mNDKBlG2WHs at public.gmane.org
> http://lists.blu.org/mailman/listinfo/discuss
>
More information about the Discuss
mailing list