The Channel logo

* Posts by PG 1

2 posts • joined Thursday 5th November 2009 18:00 GMT

PG 1
FAIL

@AC and @Hugh McIntyre

I understand the probabilities very well. What you both don't seem to realise is that *any* probability of collision > 0 for hard drive data means that it is unacceptable for use as a method of deduplicating data.

You both seem to be arguing "it is so unlikely and so rare, it'll never happen" - but what if that block happens to match something crucial, from something in an .mp3 file to your company's core database - or what about your bank account? A dedupe collision in any of those will cause pain from "this song won't play anymore" to "what do you mean my bank account is empty?". You won't think it is acceptable then.

Again, I repeat, all hashes by definition have collisions (you can't fit 1 billion bits into 1 million spaces - mathematically you have a minimum of 1000 possible inputs for each hash assuming a equispaced distribution, it gets worse if not euqispaced). Regardless of how likely a collision is, you can't use a hash for anything more than indicating that there MIGHT be a duplicate block and to act accordingly.

PG 1

re: "hash-based block addressing, deduping is a non-issue"

Oh dear... hash algorithms have collisions. You can't rely on hashing alone. Hashing can only tell you that there *might* be a dupe - you then need to compare both blocks bit-by-bit to ensure that they are really duplicates.

However, because you can't stick a TB of ram in a HD (or can you?) you can't maintain all those blocks in memory, therefore you need to read the potential block back from disk - do the compare before performing the write. That is likely to be a reasonable performance drag - maybe acceptable in the rare-ish dedupe-write events.

Minimally a 1TB drive will need approx 1GB ram onboard to maintain the block-hashes for look up. What happens after a reboot? have to read the entire drive back to build the block-hash lookup table before allowing any writes again?

Seems a little optimistic to me.

Forums

Forgotten password

Opinion

euros_channel_money

Tim Worstall

Time to take a sniff at the coffee, perhaps
joe_tucci_emc_channel

Chris Mellor

Will they have to drag him back like last time?
chain_relationship_channel

Features

cloud_accounting
Playing the SLA long game
channel_teaser_money_top
cloud computing Fight
Applications must work for the cloud to float
Paul Cormier, Red Hat
How a Unix killer crawled from the dot-com bust