---- LZRW4: ZIV AND LEMPEL MEET MARKOV ================================= Author : Ross Williams Date : 23-Jul-1991. This is a public domain document and may be copied freely so long as it is not modified (but text formatting transformations are permitted though). Note: This paper has not been polished to publication-standard. It has been prepared under considerable time pressure and is released mainly to make the ideas it contains public so as to avoid others re-inventing and patenting them. Abstract -------- In 1977 and 1978, Ziv and Lempel[Ziv77][Ziv78] introduced the LZ class of adaptive dictionary data compression techniques that have become so very popular today. In competition with these techniques are the Markov algorithms that give better compression but run much more slowly. This paper compares the two classes of algorithm and shows how algorithms that combine techniques from both classes have the potential to provide higher performance (i.e. speed/ compression trade off) than we have seen so far. An algorithm called LZRW4 that combines Markov and LZ77 ideas is sketched. Comparison of Ziv/Lempel and Markov Algorithms ---------------------------------------------- To the casual observer, the Ziv/Lempel (LZ) data compression algorithms may appear to be worlds away from the lesser know Markov algorithms. LZ algorithms construct trees of strings and transmit strings as codewords. Markov algorithms construct large numbers of contexts and fiddle with arithmetic ranges and probabilities. The two kinds of algorithms seem to operate with totally different objects. However, deep down, both classes of algorithm are using the same probabilistic engine. Deep down, both classes must be representing more common events with short codes and less common events with longer codes. This is the law of data compression. But, without some careful thought, it's hard to see how the two classes relate. In the early eighties, Glen Langdon put in some careful thought and a lot of insight and came up with two papers[Langdon83] and [Langdon84] which for me are pivotal in my understanding of the field of data compression. I cannot recommend these papers highly enough to anyone who wishes to gain an understanding of what REALLY makes LZW tick or to anyone who wants a broader perspective of the field. The papers seem highly focused and obscure, but in fact they cast a broad soft light over the whole field of text compression, illuminating areas that for me were very dark (but might not be so dark for people who read IEEE Transactions on Information Theory like Dickens). The two papers have to be read carefully though. An earlier paper [Rissanen81] by Rissanen and Langdon gives an even more general view, but does not deal directly with LZ algorithms as the later papers do and so I have focussed on the later papers. The idea behind the two papers is that all dictionary algorithms, including the LZ classes (LZ77 and LZ78), are subsets (can be emulated by) Markov algorithms, but the converse is not true. Langdon shows us a clear view of LZ algorithms through Markov eyes. This view is highly illuminating. In [Langdon83], Langdon examines the LZ78 class of algorithm closely and explains the underlying engine in probabilistic terms. In [Langdon84], he has a more thorough byte at the problem, formalizing the whole case and proving that dictionary algorithms can always be emulated by Markov algorithms, but not the converse. I will attempt to explain the idea of [Langdon83] briefly. I will review LZW but not give a detailed explanation of how it works. LZ78/LZW builds its dictionary by parsing the input as phrases consisting of the longest dictionary match to date. At the end of each phrase parse, the algorithm adds a new phrase to the dictionary consisting of the current phrase plus the next input character. The dictionary grows forever, adding one new phrase per phrase parsed. The dictionary may be viewed in many ways. We choose to view it as a (forwards) trie structure whose leaves are to the right and whose root is at the left. The following diagram illustrates such a tree. Each arc carries a single character label. Each node represents a single string. In the diagram, arcs are labelled in lower case and nodes in upper case. AA / a/ A a/ \b / \ Root AB b\ \ B All dictionaries can be viewed in this way. In LZW, each node is allocated a code value. Suppose we have two-bit code words and the codes are allocated as follows: 0: A 1: B 2: AA 3: AB This code allocation implies that each event has an equal probability of one quarter. We can thus label our tree with node occurrence probabilities. AA 1/4 / a/ A 1/4 a/ \b / \ Root AB 1/4 b\ \ B 1/4 These node probabilities can then be converted into probabilities for the ARCS: AA / /a 1/3 A 3/4a/ \b 1/3 / \ Root AB b\ 1/4\ B (the 1/3 and 1/3 don't add up because the A node (which represents an imaginary arc labelled by the empty string) has 1/3 too.) These probabilties are exactly the sort of fodder that we need to feed a Markov algorithm. From the above, it can be seen that the act of LZW of parsing the longest match and assigning a new codeword, can be viewed as the execution of a series of single-byte prediction decisions each with its own separate probabilities and each capable of being coded independently and with identical efficiency using arithmetic coding. You can't see this normally because LZW does the parse all in one go and outputs (for each phrased) a codeword which looks, for all the world, to be indivisible. But deep down, we all know that it isn't. Instead, it's made up of little pieces of entropy accumulated during the traversal and cleverly bundled into 16 bit codewords at the last minute. So far, we have converted our LZW tree to a Markov-like prediction tree. Now let's analyse it. Consider a single node in the LZW tree. Associated with the node is a string (being the characters on the arc leading from the root). Each time that the parse reaches the node, it means that the node's string has occurred at the start of a phrase boundary. LZW then goes on further to the node's child* nodes and eventually reaches a leaf where it adds a new node AND ALLOCATES A NEW CODEWORD. The result is that each node accumulates a subtree whose cardinality (=number of subnodes including itself in this paper=number of codewords allocated to the subtree) is proportional to the probability of the node's string arising as a prefix of a parse. As the number of subnodes is also the node's share of the probability space (because there is one codeword allocated to each node), we see that the amount of code space allocated to a node is proportional to the number of times it has occurred. Very neat. All this stuff is in [Langdon83] which quantifies it. The above explains why LZW compresses so well. Why then does it compress so badly (compared to Markov models)? Without going into details, here are some reasons: * Approximately half of the code space is wasted at any point of time in unallocated codewords. * Probability estimates are inaccurate near the leaves. Other such reasons eat away at the compression. In particular, there is one inefficiency of LZW-type compression that stands out. In our analysis, each LZW parse of length d consists of d single-character predictions. Each prediction is made in the context of previous predictions. Thus, after having parsed part of a phrase "ab", the LZW algorithm has two characters of Markov context with which to "predict" the next character. After having parsed "compressi", it would have nine characters of context. Thus we see that, to the extent that we can view LZ78/LZW algorithms as making predictions, they make predictions using a sawtooth context-length. Order ^ 4| *. 3| * . *. *. 2| *. * . *. * . * . 1| * . * . * . * . *. * . 0|* .* .* .* .* .* . +-----|---------|-----|-------|---|-------|-----> Characters phrase1 phr2 phr3 phr4 phr5 phr6 In contrast, Markov algorithms make predictions using a fairly flat (or sometimes totally flat) context line. Thus, while a Markov algorithm may make all of its predictions based on a 3-character context, a LZW algorithm will make only some of its predictions from that depth. The thing that really kills LZW's compression is the first one or two characters of each phrase, the first of which is made based on no context at all and the second of which is made based on a context of a single character. After that the context is sufficient not to introduce too much inefficiency. See Figure 64 of [WIlliams91] to get an idea of how much impact this has on compression. To compensate for its poor performance in the first one or two characters, LZW grows an enormous tree. This has the effect of increasing the average length of a phrase so that the beginning of phrases occurs less often. This is like Pollyanna adding extra days to the week to avoid Sundays. As the length of the input increases to infinity, so does the average phrase length, with the startling result that by the time you get to infinity, LZW has converged on the entropy of the source (as Ziv and Lempel proved in their founding paper[Ziv78]). The rate of convergence depends on the average phrase length which depends on the entropy of the source. In practice, for ordinary sources (e.g. text files), the average phrase length does not rise fast enough to properly offset the effect of the low-context-length predictions at the start of each phrase. An additional reason for the relatively poor compression performance of LZW can be found in its phrasing. Markov algorithms predict each character of the input one at a time and record contextual information at each step. In contrast LZW divides the input up into phrases each of which causes a single parse through its tree. Thus, the "zero-order" model that LZW builds up for the first character of each phrase is based only on the first character of all the previous phrases it has seen whereas a zero-order Markov model builds its model based on ALL the characters it has seen. This applies for all orders. To summarize: * All dictionary algorithms can be cast into Markov form. * The compression grunt (order) that LZW applies is in the shape of an irregular sawtooth. * LZW collects statistics for each context only on phrase boundaries. * A little thought reveals that all this applies to the LZ77 class as well. A brief discussion of all this can be found in Chapter one of my book [Williams91] and in particular section 1.13:"Dictionary Methods vs Markov Methods". However, for a full treatment, the reader is referred to [Langdon83] and [Langdon84]. Ziv and Lempel meet Markov -------------------------- Currently, the field of text data compression has two separate methods being statistical (Markov) and dictionary (Ziv and Lempel). Markov algorithms inch through the input character by character, recording and using context information at each step. This yields the best compression, but makes Markov algorithms slow. Ziv and Lempel algorithms gallop through the input one phrase at a time, losing lots of non-phrase-aligned context information and using zero order models at the start of each phrase. This yields a higher speed but poorer compression. Obviously it would be highly desirable to combine the speed of the Ziv and Lempel algorithms with the compression power of the Markov algorithms. As time goes by this seems less and less likely (without special (e.g. parallel) hardware). However, there does seem to be one way of combining the two classes of algorithm and to my surprise, I have not yet seen it used. The idea is to create a group of contexts, but to parse phrases from each context. For example, we might create 256 1-character contexts and grow an LZW tree in each of them. Compression would proceed exactly as with LZW except that the most recent character transmitted (the last character of the previous phrase) would be used to select one of the 256 contexts, each of which would contain an LZW tree which would then be used to parse the phrase. The result would be to take the low-order edge off LZW without losing any of its speed. Each character of the phrase would be "predicted" using a model one order higher than would have been used in LZW. At high orders this would have negligible effect, but at low orders it would have a significant effect, for the first character of each phrase, formerly predicted by a zero-order model, would now be predicted by a first-order model. Each phrase would be coded using only the range required by the size of the tree in its context. By using contexts at the start of the LZW tree, we raise the bottom of the sawtooth a few notches. Order ^ 6| *. 5| * . *. *. 4| *. * . *. * . * . 3| * . * . * . * . *. * . 2|* .* .* .* .* .* . 1| 0| +-----|---------|-----|-------|---|-------|-----> Characters phrase1 phr2 phr3 phr4 phr5 phr6 This idea is not at all specific to LZW, but could be applied in dozens of ways to many different fast parsing adaptive algorithms. In fact my example algorithm, LZRW4 below uses an LZ77-like allgorithm. Here are some ideas: * Use a two-byte context to hash into a table of LZW trees. * Grow backwards context trees as in DHPC, but grow LZW parsing trees out of each node. * Somehow use a single LZW tree BUT START PARSING ONE OR MORE CHARACTERS BACK! This avoids having two different kinds of data structure. * Set up a set of contexts and bolt an LZ77 compressor on the back. Each context stores a list of pointers to strings that occurred in that context (see LZRW4 below). * Use context with any other dynamic dictionary scheme. * Use a rolling hash function in hardware to avoid hash delays at the start of phrases. ...combinations and permuations. There are two drawbacks to all this. First of all we are up for a lot more memory. Second, we run up against the context-depth/sample-size tradeoff encountered by Markov algorithms. The deeper the contexts we establish, the more specific they become, but the smaller the amount of information that each context has on which to base its mini-model and hence its predictions. The first problem can be overcome with more memory. The second problem requires careful tuning. LZRW4 ----- To test out the idea, I devised LZRW4. I didn't implement LZRW4, but I did write a program to measure the compression that it would yield. This program is provided in Appendix A on an "as is" basis. Note: The name LZRW4 is used here to describe an algorithm. However, the eventual official code release of LZRW4 may differ in some details from the algorithm presented here. With a huge range of LZ algorithms from which to choose, I chose the LZ77 class with which I am familiar. This avoided patent problems. LZRW4 uses a hash table of 4096 partitions each containing 32 pointers to the input which is held in a 1 Meg buffer. At the start of each phrase, LZRW4 hashes the previous two characters to obtain a partition number in the range 0..4095. The 32 pointers in that partition are searched for the longest match and the length 3bits=[2,3,4,5,6,7,8,16] and pointer number 5bits=[0,..,31] coded and transmitted. LZRW4 uses a single control bit for each phrase to say whether a literal byte or a copy byte is coming next. These control bits are buffered as in the other LZRW* algorithms. The equal length of literal and copy items is an interesting feature of the algorithm which seems bound to be useful to someone someday! To update the partition, LZRW4 swaps the winning pointer with the pointer halfway towards the front of the partition. If the winning pointer matched less than 4 bytes then a pointer to the current position (start of the Ziv) is shifted into the partition. See the code in Appendix A for the details. An understanding of the LZRW3-A algorithm may also assist. The results of LZRW4 are not startling, but demonstrate that the idea is working. By using contexts, LZRW4 gets away with using 3-bit lengths and 5-bit "indexes" whereas its ancestor algorithm LZRW3-A uses 4-bit lengths and 12-bit "indexes", yielding comparable compression. Here are some rough results on some Calgary corpus files (percentage remaining): LZRW4 LZRW3-A Unix Compress bib 39.5 44.1 41.8 book1 51.4 54.7 43.2 book2 41.2 45.5 41.1 obj1 65.5 58.8 65.3 obj2 43.4 43.8 52.1 paper1 46.5 46.1 47.2 progc 46.4 45.2 48.3 trans 29.1 32.3 40.8 I have experiments with lots of variants of LZRW4 but have not formalized them nor written them up. One particularly interesting idea is to subdivide each partition into subpartitions based on the first character of the phrase. LZRW3-A partition management techniques can then be used. Closing Thoughts ---------------- I don't know if this idea is original or whether it will yield any benefit. However, I have not heard of any Ziv/Lempel/Markov hybrids to date, and the experiments above indicate at least that the idea works, if not that it has real competitive benefit. With the right tuning, the use of contexts could provide extra compression with little impact on speed (but usually a lot of impact on memory!). At the very least, it is another weapon in the data compression armoury. References ---------- [Langdon83] Langdon G.G., "A Note on the Ziv-Lempel Model for Compressing Individual Sequences, IEEE Transactions on Information Theory, Vol.29, No.2, pp.284-287. [Langdon84] Langdon G.G., "On Parsing Versus Mixed-Order Model Structures for Data Compression", IBM Research Report RJ-4163 (46091) 1/18/84, IBM Research Laboratory, San Jose, CA 95193, 1984. [Rissanen81] Rissanen J.J, Langdon G.G, "Universal Modeling and Coding", IEEE Transactions on Information Theory, Vol. 27, No. 1, pp.12-23. [Williams91] Williams R.N., "Adaptive Data Compression", Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061. [Ziv77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3, pp. 337-343. [Ziv78] Ziv J., Lempel A., "Compression of Individual Sequences via Variable-Rate Coding", IEEE Transactions on Information Theory, Vol. 24, No. 5, pp. 530-536. Appendix: LZRW4: Experimental Code ---------------------------------- The following code is public domain and is provided "as is". As most experimental code, it is poorly designed, structured, coded, and commented. #include /* #include */ #include #include "port.h" #define MY_BUFFER_SIZE (1024*1024) #define HASHLENGTH 4096 #define HASHDEPTH 32 #define CONTEXT 2 #define MEDMATCH 8 /* Expressible lengths are [2..MEDMATCH,MAXMATCH]. */ #define MAXMATCH 16 #define NOSHIFTTHRESH 4 /* This or longer match and we don't shift */ main(argc,argv) int argc; char **argv; { typedef UBYTE *queue[HASHDEPTH]; /* Element zero is most recent element. */ UBYTE buf[MY_BUFFER_SIZE+2000]; queue hash[HASHLENGTH]; char *input_file_name; int infile; ULONG bits_in =0; ULONG bits_out =0; if (argc != 2) { printf("To run lzcontext alg: x \n"); exit(0); } printf("Welcome to LZCONTEXT\n"); printf("--------------------\n"); printf("Buffer size = %u\n",MY_BUFFER_SIZE); printf("Hash table length = %u\n",HASHLENGTH ); printf("Hash table depth = %u\n",HASHDEPTH ); printf("Context = %u\n",CONTEXT ); printf("Med match = %u\n",MEDMATCH ); printf("Max match = %u\n",MAXMATCH ); printf("No shift threshold = %u\n",NOSHIFTTHRESH ); input_file_name=argv[1]; infile =open(input_file_name ,O_RDONLY); /* O_BINARY*/ while(TRUE) { ULONG i,j,k; UBYTE *p_next_byte; ULONG bytes_in; /* Read in a block of bytes to compress. */ bytes_in=read(infile,buf,(unsigned int) MY_BUFFER_SIZE); if (bytes_in == 0) break; /* Initialize the hash table so all entries point to a defined string. */ for (i=0;i> 4; /* index&=0x3FFF; */ index&=0xFFF; this_queue=&hash[index][0]; /* Find the longest match in the 16 recent positions. */ /* Note: Because we are only calculating entropy here, we don't */ /* actually need to know the number of the best position. */ maxlength=0; for (j=0;jmaxlength) { maxlength=k; longest_match_index=j; } } /* Both literals and copies output one control bit and eight data bits. They differ on how much input they consume. */ if (maxlength == 0) maxlength=1; if (maxlength>MEDMATCH && maxlength 1) { UBYTE *t; ULONG half=longest_match_index/2; t=this_queue[half]; this_queue[half]=this_queue[longest_match_index]; this_queue[longest_match_index]=t; } /* Shift the queue and put the new value in the recent end. */ if (maxlength < NOSHIFTTHRESH) { for (j=HASHDEPTH-1;j>0;j--) this_queue[j]=this_queue[j-1]; this_queue[0]=old_p_next_byte; } } } close(infile); printf("Bytes in = %lu.\n",bits_in /8); printf("Bytes out = %lu.\n",bits_out/8); printf("Percentage remaining=%5.1f\n",100.0*((float) bits_out)/((float) bits_in)); } void error(mess) /* Prints out the string and terminates. */ char mess[]; { printf("%s\n",mess); exit(0); } ----