--<Start of LZRW4 paper>--

LZRW4: ZIV AND LEMPEL MEET MARKOV
=================================
Author : Ross Williams
Date   : 23-Jul-1991.

This is a public domain document and may be copied freely so long as
it is not modified (but text formatting transformations are permitted
though).

Note: This paper has not been polished to publication-standard. It has
been prepared under considerable time pressure and is released mainly
to make the ideas it contains public so as to avoid others
re-inventing and patenting them.

Abstract
--------
In 1977 and 1978, Ziv and Lempel[Ziv77][Ziv78] introduced the LZ class
of adaptive dictionary data compression techniques that have become so
very popular today. In competition with these techniques are the
Markov algorithms that give better compression but run much more
slowly. This paper compares the two classes of algorithm and shows how
algorithms that combine techniques from both classes have the
potential to provide higher performance (i.e. speed/ compression trade
off) than we have seen so far. An algorithm called LZRW4 that combines
Markov and LZ77 ideas is sketched.

Comparison of Ziv/Lempel and Markov Algorithms
----------------------------------------------
To the casual observer, the Ziv/Lempel (LZ) data compression
algorithms may appear to be worlds away from the lesser know Markov
algorithms. LZ algorithms construct trees of strings and transmit
strings as codewords. Markov algorithms construct large numbers of
contexts and fiddle with arithmetic ranges and probabilities. The two
kinds of algorithms seem to operate with totally different objects.

However, deep down, both classes of algorithm are using the same
probabilistic engine. Deep down, both classes must be representing
more common events with short codes and less common events with longer
codes. This is the law of data compression. But, without some careful
thought, it's hard to see how the two classes relate.

In the early eighties, Glen Langdon put in some careful thought and a
lot of insight and came up with two papers[Langdon83] and [Langdon84]
which for me are pivotal in my understanding of the field of data
compression. I cannot recommend these papers highly enough to anyone
who wishes to gain an understanding of what REALLY makes LZW tick or
to anyone who wants a broader perspective of the field. The papers
seem highly focused and obscure, but in fact they cast a broad soft
light over the whole field of text compression, illuminating areas
that for me were very dark (but might not be so dark for people who
read IEEE Transactions on Information Theory like Dickens). The two
papers have to be read carefully though. An earlier paper [Rissanen81]
by Rissanen and Langdon gives an even more general view, but does not
deal directly with LZ algorithms as the later papers do and so I have
focussed on the later papers.

The idea behind the two papers is that all dictionary algorithms,
including the LZ classes (LZ77 and LZ78), are subsets (can be emulated
by) Markov algorithms, but the converse is not true. Langdon shows us
a clear view of LZ algorithms through Markov eyes. This view is highly
illuminating.

In [Langdon83], Langdon examines the LZ78 class of algorithm closely
and explains the underlying engine in probabilistic terms. In
[Langdon84], he has a more thorough byte at the problem, formalizing
the whole case and proving that dictionary algorithms can always be
emulated by Markov algorithms, but not the converse.

I will attempt to explain the idea of [Langdon83] briefly. I will
review LZW but not give a detailed explanation of how it works.

LZ78/LZW builds its dictionary by parsing the input as phrases
consisting of the longest dictionary match to date. At the end of each
phrase parse, the algorithm adds a new phrase to the dictionary
consisting of the current phrase plus the next input character. The
dictionary grows forever, adding one new phrase per phrase parsed.

The dictionary may be viewed in many ways. We choose to view it as a
(forwards) trie structure whose leaves are to the right and whose root
is at the left. The following diagram illustrates such a tree. Each
arc carries a single character label. Each node represents a single
string. In the diagram, arcs are labelled in lower case and nodes in
upper case.

         AA
        /
      a/
      A
    a/ \b
    /   \
Root     AB
   b\
     \
      B

All dictionaries can be viewed in this way.

In LZW, each node is allocated a code value. Suppose we have two-bit
code words and the codes are allocated as follows:

   0: A
   1: B
   2: AA
   3: AB

This code allocation implies that each event has an equal probability
of one quarter. We can thus label our tree with node occurrence
probabilities.

         AA 1/4
        /
      a/
      A  1/4
    a/ \b
    /   \
Root     AB 1/4
   b\
     \
      B 1/4

These node probabilities can then be converted into probabilities for
the ARCS:

         AA
        /
       /a 1/3
      A
 3/4a/ \b 1/3
    /   \
Root     AB
   b\
  1/4\
      B

(the 1/3 and 1/3 don't add up because the A node (which represents an
imaginary arc labelled by the empty string) has 1/3 too.)

These probabilties are exactly the sort of fodder that we need to feed
a Markov algorithm. From the above, it can be seen that the act of LZW
of parsing the longest match and assigning a new codeword, can be
viewed as the execution of a series of single-byte prediction
decisions each with its own separate probabilities and each capable of
being coded independently and with identical efficiency using
arithmetic coding. You can't see this normally because LZW does the
parse all in one go and outputs (for each phrased) a codeword which
looks, for all the world, to be indivisible. But deep down, we all
know that it isn't. Instead, it's made up of little pieces of entropy
accumulated during the traversal and cleverly bundled into 16 bit
codewords at the last minute.

So far, we have converted our LZW tree to a Markov-like prediction
tree. Now let's analyse it.

Consider a single node in the LZW tree. Associated with the node is a
string (being the characters on the arc leading from the root). Each
time that the parse reaches the node, it means that the node's string
has occurred at the start of a phrase boundary. LZW then goes on
further to the node's child* nodes and eventually reaches a leaf where
it adds a new node AND ALLOCATES A NEW CODEWORD. The result is that
each node accumulates a subtree whose cardinality (=number of subnodes
including itself in this paper=number of codewords allocated to the
subtree) is proportional to the probability of the node's string
arising as a prefix of a parse. As the number of subnodes is also the
node's share of the probability space (because there is one codeword
allocated to each node), we see that the amount of code space
allocated to a node is proportional to the number of times it has
occurred. Very neat.

All this stuff is in [Langdon83] which quantifies it.

The above explains why LZW compresses so well. Why then does it
compress so badly (compared to Markov models)? Without going into
details, here are some reasons:

   * Approximately half of the code space is wasted at any point of time
     in unallocated codewords.
   * Probability estimates are inaccurate near the leaves.

Other such reasons eat away at the compression. In particular, there
is one inefficiency of LZW-type compression that stands out.

In our analysis, each LZW parse of length d consists of d
single-character predictions. Each prediction is made in the context
of previous predictions. Thus, after having parsed part of a phrase
"ab", the LZW algorithm has two characters of Markov context with
which to "predict" the next character. After having parsed
"compressi", it would have nine characters of context. Thus we see
that, to the extent that we can view LZ78/LZW algorithms as making
predictions, they make predictions using a sawtooth context-length.

Order
     ^
    4|              *.
    3|            *  .            *.          *.
    2|    *.    *    .    *.    *  .        *  .
    1|  *  .  *      .  *  .  *    .  *.  *    .
    0|*    .*        .*    .*      .*  .*      .
     +-----|---------|-----|-------|---|-------|-----> Characters
    phrase1    phr2   phr3   phr4  phr5   phr6

In contrast, Markov algorithms make predictions using a fairly flat
(or sometimes totally flat) context line. Thus, while a Markov
algorithm may make all of its predictions based on a 3-character
context, a LZW algorithm will make only some of its predictions from
that depth.

The thing that really kills LZW's compression is the first one or two
characters of each phrase, the first of which is made based on no
context at all and the second of which is made based on a context of a
single character. After that the context is sufficient not to
introduce too much inefficiency. See Figure 64 of [WIlliams91] to get
an idea of how much impact this has on compression.

To compensate for its poor performance in the first one or two
characters, LZW grows an enormous tree. This has the effect of
increasing the average length of a phrase so that the beginning of
phrases occurs less often. This is like Pollyanna adding extra days to
the week to avoid Sundays. As the length of the input increases to
infinity, so does the average phrase length, with the startling result
that by the time you get to infinity, LZW has converged on the entropy
of the source (as Ziv and Lempel proved in their founding
paper[Ziv78]). The rate of convergence depends on the average phrase
length which depends on the entropy of the source. In practice, for
ordinary sources (e.g. text files), the average phrase length does not
rise fast enough to properly offset the effect of the
low-context-length predictions at the start of each phrase.

An additional reason for the relatively poor compression performance
of LZW can be found in its phrasing. Markov algorithms predict each
character of the input one at a time and record contextual information
at each step. In contrast LZW divides the input up into phrases each
of which causes a single parse through its tree. Thus, the
"zero-order" model that LZW builds up for the first character of each
phrase is based only on the first character of all the previous
phrases it has seen whereas a zero-order Markov model builds its
model based on ALL the characters it has seen. This applies for all
orders.

To summarize:

   * All dictionary algorithms can be cast into Markov form.

   * The compression grunt (order) that LZW applies is in the shape
     of an irregular sawtooth.

   * LZW collects statistics for each context only on phrase boundaries.

   * A little thought reveals that all this applies to the LZ77 class
     as well.

A brief discussion of all this can be found in Chapter one of my book
[Williams91] and in particular section 1.13:"Dictionary Methods vs
Markov Methods". However, for a full treatment, the reader is referred
to [Langdon83] and [Langdon84].

Ziv and Lempel meet Markov
--------------------------
Currently, the field of text data compression has two separate methods
being statistical (Markov) and dictionary (Ziv and Lempel). Markov
algorithms inch through the input character by character, recording
and using context information at each step. This yields the best
compression, but makes Markov algorithms slow. Ziv and Lempel
algorithms gallop through the input one phrase at a time, losing lots
of non-phrase-aligned context information and using zero order models
at the start of each phrase. This yields a higher speed but poorer
compression.

Obviously it would be highly desirable to combine the speed of the Ziv
and Lempel algorithms with the compression power of the Markov
algorithms. As time goes by this seems less and less likely (without
special (e.g. parallel) hardware). However, there does seem to be one
way of combining the two classes of algorithm and to my surprise, I
have not yet seen it used.

The idea is to create a group of contexts, but to parse phrases from
each context. For example, we might create 256 1-character contexts
and grow an LZW tree in each of them. Compression would proceed
exactly as with LZW except that the most recent character transmitted
(the last character of the previous phrase) would be used to select
one of the 256 contexts, each of which would contain an LZW tree which
would then be used to parse the phrase.

The result would be to take the low-order edge off LZW without losing
any of its speed. Each character of the phrase would be "predicted"
using a model one order higher than would have been used in LZW. At
high orders this would have negligible effect, but at low orders it
would have a significant effect, for the first character of each
phrase, formerly predicted by a zero-order model, would now be
predicted by a first-order model. Each phrase would be coded using
only the range required by the size of the tree in its context.

By using contexts at the start of the LZW tree, we raise the
bottom of the sawtooth a few notches.

Order
     ^
    6|              *.
    5|            *  .            *.          *.
    4|    *.    *    .    *.    *  .        *  .
    3|  *  .  *      .  *  .  *    .  *.  *    .
    2|*    .*        .*    .*      .*  .*      .
    1|
    0|
     +-----|---------|-----|-------|---|-------|-----> Characters
    phrase1    phr2   phr3   phr4  phr5   phr6


This idea is not at all specific to LZW, but could be applied in
dozens of ways to many different fast parsing adaptive algorithms. In
fact my example algorithm, LZRW4 below uses an LZ77-like allgorithm.
Here are some ideas:

   * Use a two-byte context to hash into a table of LZW trees.
   * Grow backwards context trees as in DHPC, but grow LZW parsing trees
     out of each node.
   * Somehow use a single LZW tree BUT START PARSING ONE OR MORE
     CHARACTERS BACK! This avoids having two different kinds of data structure.
   * Set up a set of contexts and bolt an LZ77 compressor on the
     back. Each context stores a list of pointers to strings that
     occurred in that context (see LZRW4 below).
   * Use context with any other dynamic dictionary scheme.
   * Use a rolling hash function in hardware to avoid hash delays at the
     start of phrases.

...combinations and permuations.

There are two drawbacks to all this. First of all we are up for a lot
more memory. Second, we run up against the context-depth/sample-size
tradeoff encountered by Markov algorithms. The deeper the contexts we
establish, the more specific they become, but the smaller the amount
of information that each context has on which to base its mini-model
and hence its predictions. The first problem can be overcome with more
memory. The second problem requires careful tuning.

LZRW4
-----
To test out the idea, I devised LZRW4. I didn't implement LZRW4, but I
did write a program to measure the compression that it would yield.
This program is provided in Appendix A on an "as is" basis.

Note: The name LZRW4 is used here to describe an algorithm. However,
the eventual official code release of LZRW4 may differ in some details
from the algorithm presented here.

With a huge range of LZ algorithms from which to choose, I chose the
LZ77 class with which I am familiar. This avoided patent problems.

LZRW4 uses a hash table of 4096 partitions each containing 32 pointers
to the input which is held in a 1 Meg buffer. At the start of each
phrase, LZRW4 hashes the previous two characters to obtain a partition
number in the range 0..4095. The 32 pointers in that partition are
searched for the longest match and the length 3bits=[2,3,4,5,6,7,8,16]
and pointer number 5bits=[0,..,31] coded and transmitted. LZRW4 uses a
single control bit for each phrase to say whether a literal byte or a
copy byte is coming next. These control bits are buffered as in the
other LZRW* algorithms. The equal length of literal and copy items is
an interesting feature of the algorithm which seems bound to be useful
to someone someday!

To update the partition, LZRW4 swaps the winning pointer with the
pointer halfway towards the front of the partition. If the winning
pointer matched less than 4 bytes then a pointer to the current
position (start of the Ziv) is shifted into the partition.

See the code in Appendix A for the details. An understanding of the
LZRW3-A algorithm may also assist.

The results of LZRW4 are not startling, but demonstrate that the idea
is working. By using contexts, LZRW4 gets away with using 3-bit
lengths and 5-bit "indexes" whereas its ancestor algorithm LZRW3-A
uses 4-bit lengths and 12-bit "indexes", yielding comparable
compression. Here are some rough results on some Calgary corpus files
(percentage remaining):

           LZRW4 LZRW3-A  Unix Compress
   bib     39.5    44.1   41.8
   book1   51.4    54.7   43.2
   book2   41.2    45.5   41.1
   obj1    65.5    58.8   65.3
   obj2    43.4    43.8   52.1
   paper1  46.5    46.1   47.2
   progc   46.4    45.2   48.3
   trans   29.1    32.3   40.8

I have experiments with lots of variants of LZRW4 but have not
formalized them nor written them up. One particularly interesting idea
is to subdivide each partition into subpartitions based on the first
character of the phrase. LZRW3-A partition management techniques can
then be used.

Closing Thoughts
----------------
I don't know if this idea is original or whether it will yield any
benefit. However, I have not heard of any Ziv/Lempel/Markov hybrids to
date, and the experiments above indicate at least that the idea works,
if not that it has real competitive benefit. With the right tuning,
the use of contexts could provide extra compression with little impact
on speed (but usually a lot of impact on memory!). At the very least,
it is another weapon in the data compression armoury.

References
----------
[Langdon83] Langdon G.G., "A Note on the Ziv-Lempel Model for
Compressing Individual Sequences, IEEE Transactions on Information Theory,
Vol.29, No.2, pp.284-287.

[Langdon84] Langdon G.G., "On Parsing Versus Mixed-Order Model
Structures for Data Compression", IBM Research Report RJ-4163 (46091)
1/18/84, IBM Research Laboratory, San Jose, CA 95193, 1984.

[Rissanen81] Rissanen J.J, Langdon G.G, "Universal Modeling and Coding",
IEEE Transactions on Information Theory, Vol. 27, No. 1, pp.12-23.

[Williams91] Williams R.N., "Adaptive Data Compression", Kluwer Academic
Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061.

[Ziv77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3,
pp. 337-343.

[Ziv78] Ziv J., Lempel A., "Compression of Individual Sequences via
Variable-Rate Coding", IEEE Transactions on Information Theory,
Vol. 24, No. 5, pp. 530-536.

Appendix: LZRW4: Experimental Code
----------------------------------
The following code is public domain and is provided "as is". As most
experimental code, it is poorly designed, structured, coded, and
commented.

#include <stdio.h>
/* #include <unix.h>    */
#include <fcntl.h>
#include "port.h"

#define MY_BUFFER_SIZE (1024*1024)
#define HASHLENGTH  4096
#define HASHDEPTH     32
#define CONTEXT        2
#define MEDMATCH       8 /* Expressible lengths are [2..MEDMATCH,MAXMATCH]. */
#define MAXMATCH      16
#define NOSHIFTTHRESH  4 /* This or longer match and we don't shift  */

main(argc,argv)
int argc;
char **argv;
{
 typedef UBYTE *queue[HASHDEPTH];
 /* Element zero is most recent element. */

 UBYTE buf[MY_BUFFER_SIZE+2000];
 queue hash[HASHLENGTH];

 char *input_file_name;
 int infile;

 ULONG bits_in  =0;
 ULONG bits_out =0;
 if (argc != 2)
   {
    printf("To run lzcontext alg: x <filename>\n");
    exit(0);
   }

 printf("Welcome to LZCONTEXT\n");
 printf("--------------------\n");
 printf("Buffer size        = %u\n",MY_BUFFER_SIZE);
 printf("Hash table length  = %u\n",HASHLENGTH    );
 printf("Hash table depth   = %u\n",HASHDEPTH     );
 printf("Context            = %u\n",CONTEXT       );
 printf("Med match          = %u\n",MEDMATCH      );
 printf("Max match          = %u\n",MAXMATCH      );
 printf("No shift threshold = %u\n",NOSHIFTTHRESH   );

 input_file_name=argv[1];

 infile =open(input_file_name ,O_RDONLY); /* O_BINARY*/
 while(TRUE)
   {
    ULONG  i,j,k;
    UBYTE *p_next_byte;
    ULONG bytes_in;

    /* Read in a block of bytes to compress. */
    bytes_in=read(infile,buf,(unsigned int) MY_BUFFER_SIZE);
    if (bytes_in == 0)
       break;

    /* Initialize the hash table so all entries point to a defined string. */
    for (i=0;i<HASHLENGTH;i++)
       for (j=0;j<HASHDEPTH;j++)
          hash[i][j]=(UBYTE *) "0123456789ABCDEF";


    /* Get past the first few context bytes. */
    bits_in +=CONTEXT*8;
    bits_out+=CONTEXT*9;
    p_next_byte=buf+CONTEXT;

    /* Loop once for each item. */
    /* The end is sloppy but this is only approximate so who cares? */
    while (p_next_byte < (buf+bytes_in))
      {
       ULONG index;
       UWORD maxlength;
       UBYTE **this_queue;
       UBYTE *old_p_next_byte=p_next_byte;
       ULONG longest_match_index;

       index= (*(p_next_byte-2)<<8) ^ (*(p_next_byte-1));

       /* context of 3
       index= (*(p_next_byte-3)<<8) ^
              (*(p_next_byte-2)<<4) ^
              (*(p_next_byte-1)   );
       */

       index=(40543*index) >> 4;
       /* index&=0x3FFF;        */
       index&=0xFFF;
       this_queue=&hash[index][0];

       /* Find the longest match in the 16 recent positions.           */
       /* Note: Because we are only calculating entropy here, we don't */
       /* actually need to know the number of the best position.       */
       maxlength=0;
       for (j=0;j<HASHDEPTH;j++)
         {
          UBYTE *p = this_queue[j];
          for (k=0;k<MAXMATCH;k++)
             if (p_next_byte[k] != p[k])
               break;
          if (k>maxlength)
            {
             maxlength=k;
             longest_match_index=j;
            }
         }

       /* Both literals and copies output one control bit and eight
          data bits. They differ on how much input they consume. */
       if (maxlength == 0) maxlength=1;
       if (maxlength>MEDMATCH && maxlength<MAXMATCH) maxlength=MEDMATCH;

       p_next_byte+=maxlength;
       bits_in  += 8*maxlength;
       bits_out += 9;

       /* If there was a match of 2 or greater, swap the matching
          element with the one half the distance to the head. */
       if (maxlength > 1)
         {
          UBYTE *t;
          ULONG half=longest_match_index/2;
          t=this_queue[half];
          this_queue[half]=this_queue[longest_match_index];
          this_queue[longest_match_index]=t;
         }

       /* Shift the queue and put the new value in the recent end. */
       if (maxlength < NOSHIFTTHRESH)
         {
          for (j=HASHDEPTH-1;j>0;j--)
             this_queue[j]=this_queue[j-1];
          this_queue[0]=old_p_next_byte;
         }
      }
   }

 close(infile);
 printf("Bytes in  = %lu.\n",bits_in /8);
 printf("Bytes out = %lu.\n",bits_out/8);
 printf("Percentage remaining=%5.1f\n",100.0*((float) bits_out)/((float)
 bits_in));
}

void error(mess)
/* Prints out the string and terminates. */
char mess[];
{
 printf("%s\n",mess);
 exit(0);
}

--<End of LZRW4 paper>--