Backing up sparse files ... VM's and TrueCrypt ... etc
Tom Metro
tmetro-blu-5a1Jt6qxUNc at public.gmane.org
Sun Feb 21 15:26:37 EST 2010
Edward Ned Harvey wrote:
> · Never use --sparse when creating an archive that is
> compressed. It’s pointless, and doubles the time to create archive.
>
> · Yes, use --sparse during extraction, if the contents contain a
> lot of serial 0’s and you want the files restored to a sparse state.
>
> The man page saying “using '--sparse' is not needed on extraction” is
> misleading. It’s technically true – you don’t need it – but it’s
> misleading – yes you need it if you want the files to be extracted sparsely.
Have you confirmed that through code inspection or experimentation?
I haven't tested it, but as I dug deeper and saw that they had a special
tar file header for sparse files, it made perfect sense that the
'--sparse' option was superfluous on extraction, because tar can see
from the header that the file is flagged as being sparse. It's logical
that they'd hard wire the "sparse writing" magic to be activated by that
flag, and ignore command line options.
Also consider that the code to detect strings of zeros seems to be on
the read side (based on the man page description). On extraction, it
wouldn't make sense to expand the unused portions to strings of zeros,
then follow that by code that detects the zeros and seeks past them to
write a sparse file.
You can test this by taring a file containing several blocks of zeros
followed by a few bytes of data without the '--sparse' option. Then
extract it with the '--sparse' option and see if it gets turned into a
sparse file.
> ...you may be overestimating the time to read or md5sum all the 0's
> in the hole of sparse files.
Perhaps, but...
> The hypothetical sparse_cat would improve performance, but just
> marginally.
...it would eliminate the need for a two-pass read with tar. And if
summing zeros is fast, why is rsync so slow in your experiments?
(A literal sparse_cat (drop-in replacement for cat) wouldn't actually be
that useful, as you need to communicate to the process receiving the
stream the byte offset for each chunk of data, assuming you want to be
able to reconstruct the sparse file later with the same holes. So
practically speaking, this is something you'd have to integrate into
tar, gzip, rsync, or whatever archiver you're using.
It sounds like it would be a small project to patch tar to use the
fcntl, as it already has a data structure figured out for recording the
holes. But you'd still need additional hacks to do incremental
transfers. So the bigger win would be patching rsync.)
-Tom
--
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/
More information about the Discuss
mailing list