endless compression

dsr at tao.merseine.nu dsr at tao.merseine.nu
Fri Aug 13 07:13:01 EDT 2004


On Thu, Aug 12, 2004 at 03:54:31PM -0400, markw at mohawksoft.com wrote:
> (Let me preface this with some background, I've written a few format
> specific compressors, and have implemented a few standard compressors.)

I shudder to think about them.

> Lets talk about *what* compression is. Compression is representing a
> theoretically larger range of data with a smaller range. Compression is
> removing "knowledge" about the data from the data, leaving only the
> "interesting" part that is needed to reconstitute the data based on
> knowledge.

Bzzt, wrong.

You've just described "lossy compression". Lossy compression
throws out information. The file which is reconstructed is NOT
the same as the original file; it is an approximation which in
some domains (notably audio and video) may be considered "good
enough".

Lossless compression does not destroy information. Lossless
compression is finding a more compact notation to express
exactly the same information. It removes redundancies and
replaces them with an explanation of how to replace them
exactly.

> The idea, then, is to take knowledge of a numerical series or knowledge
> about data. The theoretical number range may be X but in reality there is
> only Y variation, and if Y < X, the data should be compressable.

You have now taken an example of lossless and an example of
lossy and announced that lossy is always acceptable. It isn't.

> When you look at a ZIP compressed file, it does not compress well using
> standard stream compressors. This is because there are few if any
> repeating paterns. The fact that it has few repeating patterns, makes this
> stream interresting. It means that it is less likely in a universe of
> streams. If it is less likely, then it can be represented with fewer bits.

OK, you've defined "information" the way Shannon did. Big deal.

> Standard compressors work on the notion that repeated patterns can be
> saved once and re-used when they show up again. LZW actually uses more
> bits per byte for each byte, but many fewer bits for strings of repeated
> bytes.

Erm... kind of. LZW uses a dictionary-and-quoting method.
Repeated byte patterns are stored in the dictionary, taking up
slightly more space than a single copy would, and then a symbol
for the pattern -- a reference to the dictionary -- is inserted.
But LZW is smart enough to simply store quoted information --
non-referenceable strings -- as themselves. Most of the
improvements in this system have come from adjustments in
dictionary hashing, look-ahead window size and sub-pattern
matching.

> A "re-compressor" is different, rather than "repeated" bytes, it would use
> information about special sequences of data that have little or no
> repitition and few recognizable patterns.

Where does this information come from? 

> This may be hugely expensive to
> compute and offer only a compressability of a few percent, however, if it
> repeatable,

Then it didn't work.

> then it can iterate and iterate untill the overhead for the
> compression is larger than the data.

> I'm not saying it is practical, workable, or really possible, I'm just
> saying that, in theory, it may be possible.

Can I accurately rephrase that as "I'm not saying that it is
possible, I'm saying that it may be possible."? 

OK, you and Jules are cranks. Go read the sci.compression FAQ a
few more times, and leave us alone.

-dsr-



More information about the Discuss mailing list