KDS transformation to improve compression ration


KDS - transformation stands for Keep Dominant Symbols - transformation. The advantage of this method is that those statistical properties of the input stream that can be efficiently used by any compression algorithm are not dramatically changed.
Consider the following procedure to build a Huffman code for an alphabet V, where pa is the probability of appearance of symbol a in an input source:
  • if |V|=2 then V={a,b} and code(a)=0 and code(b)=1.
  • if |V|>2 then select from V two symbols with smallest probabilities (call them a and b). Then code(a)=code(#)0, code(b)=code(#)1, where # is a new symbol with p# = pa + pb, and code(#) is computed applying the same procedure with the alphabet V \ {a,b} {#}.

The same is the essence of this transformation, to replace symbols with low probabilities with a new symbol.
Further we present formally this transformation.

Consider A is an alphabet and w is a word over the alphabet A. Consider A = R D with |R| = |D|, where R is a subset of the alphabet A containing the "rare" symbols and D a subset containing the "dominant" symbols in the word w.

Is is easy to see that there is an unique decomposition of w:
w = r1 d1 r2 d2 ... rk dk,

where:
  • ri R+, 1 < i k ;
  • r1 R*;
  • di D+, 1 i < k ;
  • dk D*.

The first step of our transformation is rewriting w in w', where:
w' = #|r1| d1 #|r2| d2 ... #|rk| dk ,

where # is a new symbol, # A.

Consider now a bijective mapping:  f:R D. In the second step of the transformation the string w'' is obtained in the following way:
w'' = f(r1) f(r2) ... f(rk) # w'.


In the final step the new symbol # and an explicit representation of the mapping f are added at the beginning of string w''. This way the string w''' is obtained:
w'''= # ri1 f(ri1)... ri|R| f(ri|R|) # w'',

where R = { ri1 ... ri|R| }.

The reverse transformation works in the following way: given a string w'''(obtained as above) consider the first symbol of w''', as the separator #. Consider now the following factorization of w''':
w'''= # w1 # w2 # w3,

where w1 { R D }|R|, w2 R, w3 D { # }. It is easy to observe that there is an unique such a factorization of w'''.

A Case Study

We present here the effect of this transformation combined with some traditional compression methods.

The length (in bits) of the Lempel-Ziv code of a word w V* is:

where b is the number of blocks resulting from LZ-parsing and k is the size of the alphabet of the string w.

Consider now the case of the alphabet A where |A|=2p, p > 1. Let u and v be two words, u A* and v (A { # } )* , v is obtained form u using the above transformation.

which telescope into:

where b is the number of blocks resulting from LZ parsing of the string u.



where b' is the number of blocks resulting from LZ parsing of the string v.

If compression is improved by transformation presented above.

An implementation of this algorithms improve compression from 5\% up to 10\% in combination with compression tools like gzip. In the following table some experimental results are presented:
Original sizegzipKDS+gzip
104448 19926 19756
263680 68084 68527
829440 211659 202124
913755 293157 287293
1691136 424372 407384
2156032 511538 489263
3259904 597233 579094
5751296 13211381278913
6125056 11689091126271
2884403264099806228254
The first column contains the size (in bytes) of original files. Test were performed using Java bytecode files. The second file contains the size of the files compressed using gzip. Finally, the last column contains files encoded first using KDS transformation and then by gzip.

Download:KDS transformation implementation in Visual C++.