Backing up sparse files ... VM's and TrueCrypt ... etc

Sun Feb 21 15:26:37 EST 2010

Edward Ned Harvey wrote:
> ·         Never use --sparse when creating an archive that is 
> compressed.  It’s pointless, and doubles the time to create archive.
> 
> ·         Yes, use --sparse during extraction, if the contents contain a 
> lot of serial 0’s and you want the files restored to a sparse state.
> 
> The man page saying “using '--sparse' is not needed on extraction” is 
> misleading.  It’s technically true – you don’t need it – but it’s 
> misleading – yes you need it if you want the files to be extracted sparsely.

Have you confirmed that through code inspection or experimentation?

I haven't tested it, but as I dug deeper and saw that they had a special 
tar file header for sparse files, it made perfect sense that the 
'--sparse' option was superfluous on extraction, because tar can see 
from the header that the file is flagged as being sparse. It's logical 
that they'd hard wire the "sparse writing" magic to be activated by that 
flag, and ignore command line options.

Also consider that the code to detect strings of zeros seems to be on 
the read side (based on the man page description). On extraction, it 
wouldn't make sense to expand the unused portions to strings of zeros, 
then follow that by code that detects the zeros and seeks past them to 
write a sparse file.

You can test this by taring a file containing several blocks of zeros 
followed by a few bytes of data without the '--sparse' option. Then 
extract it with the '--sparse' option and see if it gets turned into a 
sparse file.

> ...you may be overestimating the time to read or md5sum all the 0's
> in the hole of sparse files.

Perhaps, but...

> The hypothetical sparse_cat would improve performance, but just
> marginally.

...it would eliminate the need for a two-pass read with tar. And if 
summing zeros is fast, why is rsync so slow in your experiments?

(A literal sparse_cat (drop-in replacement for cat) wouldn't actually be 
that useful, as you need to communicate to the process receiving the 
stream the byte offset for each chunk of data, assuming you want to be 
able to reconstruct the sparse file later with the same holes. So 
practically speaking, this is something you'd have to integrate into 
tar, gzip, rsync, or whatever archiver you're using.

It sounds like it would be a small project to patch tar to use the 
fcntl, as it already has a data structure figured out for recording the 
holes. But you'd still need additional hacks to do incremental 
transfers. So the bigger win would be patching rsync.)

  -Tom

-- 
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/