KDS transformation to improve compression ration
KDS - transformation stands for
Keep Dominant
Symbols - transformation. The advantage of this method is that
those statistical properties of the input stream that can be
efficiently used by any compression algorithm are not dramatically
changed.
Consider the following procedure to build a Huffman code for an
alphabet
V, where
pa is the probability of appearance of
symbol
a in an input source:
- if |V|=2 then V={a,b} and code(a)=0
and code(b)=1.
- if |V|>2 then select from V two symbols with
smallest probabilities (call them a and b).
Then code(a)=code(#)0, code(b)=code(#)1,
where # is a new symbol with p# = pa + pb,
and
code(#) is computed applying the same procedure with the alphabet
V \ {a,b}
{#}.
The same is the essence of this transformation, to replace symbols
with low probabilities with a new symbol.
Further we present formally this transformation.
Consider
A is an alphabet and
w is a word over the alphabet
A. Consider
A = R
D with
|
R| = |
D|, where
R is a subset
of the alphabet
A containing the "rare" symbols and
D a subset
containing the "dominant" symbols in the word
w.
Is is easy to see that there is an unique decomposition of
w:
w = r1 d1 r2 d2 ... rk dk,
where:
- ri
R+, 1 < i
k ;
- r1
R*;
- di
D+, 1
i < k ;
- dk
D*.
The first step of our transformation is rewriting
w in
w',
where:
w'
= #|r1| d1
#|r2| d2 ...
#|rk| dk ,
where # is a new symbol, #
A.
Consider now a bijective mapping:
f:
R
D.
In the second
step of the transformation the string
w'' is obtained in the
following way:
w'' = f(r1) f(r2) ... f(rk) # w'.
In the final step the new symbol # and an explicit
representation of the mapping
f are added at the beginning of
string
w''.
This way the string
w''' is obtained:
w'''= # ri1 f(ri1)...
ri|R| f(ri|R|) # w'',
where
R = {
ri1 ...
ri|R| }.
The reverse transformation works in the following way: given a
string
w'''(obtained as above) consider the first symbol of
w''', as the separator #. Consider now the following
factorization of
w''':
w'''= # w1 # w2 # w3,
where
w1 
{
R D }
|R|,
w2
R,
w3
D 
{ # }.
It is easy to observe that there is an unique such a factorization of
w'''.
A Case Study
We present here the effect of this transformation
combined with some traditional compression methods.
The length (in bits) of the Lempel-Ziv code of a word
w
V* is:
where
b is the number of blocks resulting from LZ-parsing and
k is the size of the alphabet of the string
w.
Consider now the case of the alphabet
A where |
A|=2
p,
p > 1.
Let
u and
v be two words,
u
A* and
v
(
A 
{ # } )
*
,
v is obtained form
u using the above
transformation.
which telescope into:
where
b is the number of blocks resulting from LZ parsing of the
string
u.
where
b' is the number of blocks resulting from LZ parsing of
the string
v.
If

compression is improved by
transformation presented above.
An implementation of this algorithms improve compression from 5\%
up to 10\% in combination with compression tools like
gzip. In the following table some experimental results
are presented:
Original size | gzip | KDS+gzip |
104448 | 19926 | 19756 |
263680 | 68084 | 68527 |
829440 | 211659 | 202124 |
913755 | 293157 | 287293 |
1691136 | 424372 | 407384 |
2156032 | 511538 | 489263 |
3259904 | 597233 | 579094 |
5751296 | 1321138 | 1278913 |
6125056 | 1168909 | 1126271 |
28844032 | 6409980 | 6228254 |
The first column contains the size (in bytes) of
original files. Test were performed using Java bytecode files. The
second file contains the size of the files compressed using
gzip. Finally, the last column contains files encoded
first using KDS transformation and then by
gzip.
Download:KDS transformation implementation in Visual C++.