SEMINARS AND PROJECTS: DATA COMPRESSION

Abstract

This paper surveys a variety of data compression methods spanning almost forty years of research, from the work of Shannon, Fano and Huffman in the late 40's to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems.

Concepts from information theory, as they relate to the goals and evaluation of data compression methods, are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported and possibilities for future research are suggested.

INTRODUCTION

Data compression is often referred to as coding, where coding is a very general term encompassing any special representation of data which satisfies a given need. Information theory is defined to be the study of efficient coding and its consequences, in the form of speed of transmission and probability of error [Ingels 1971]. Data compression may be viewed as a branch of information theory in which the primary objective is to minimize the amount of data to be transmitted. The purpose of this paper is to present and analyze a variety of data compression algorithms.

A simple characterization of data compression is that it involves transforming a string of characters in some representation (such as ASCII) into a new string (of bits, for example) which contains the same information but whose length is as small as possible. Data compression has important application in the areas of data transmission and data storage. Many data processing applications require storage of large volumes of data, and the number of such applications is constantly increasing as the use of computers extends to new disciplines. At the same time, the proliferation of computer communication networks is resulting in massive transfer of data over communication links. Compressing data to be stored or transmitted reduces storage and/or communication costs. When the amount of data to be transmitted is reduced, the effect is that of increasing the capacity of the communication channel. Similarly, compressing a file to half of its original size is equivalent to doubling the capacity of the storage medium. It may then become feasible to store the data at a higher, thus faster, level of the storage hierarchy and reduce the load on the input/output channels of the computer system.

Many of the methods to be discussed in this paper are implemented in production systems. The UNIX utilities compact and compress are based on methods to be discussed in Sections 4 and 5 respectively [UNIX 1984]. Popular file archival systems such as ARC and PKARC employ techniques presented in Sections 3 and 5 [ARC 1986; PKARC 1987]. The savings achieved by data compression can be dramatic; reduction as high as 80% is not uncommon [Reghbati 1981]. Typical values of compression provided by compact are: text (38%), Pascal source (43%), C source (36%) and binary (19%). Compress generally achieves better compression (50-60% for text such as source code and English), and takes less time to compute [UNIX 1984]. Arithmetic coding (Section 3.4) has been reported to reduce a file to anywhere from 12.1 to 73.5% of its original size [Witten et al. 1987]. Cormack reports that data compression programs based on Huffman coding (Section 3.2) reduced the size of a large student-record database by 42.1% when only some of the information was compressed. As a consequence of this size reduction, the number of disk operations required to load the database was reduced by 32.7% [Cormack 1985]. Data compression routines developed with specific applications in mind have achieved compression factors as high as 98% [Severance 1983].

Data Compression

FUNDAMENTAL CONCEPTS

A brief introduction to information theory is provided in this section. The definitions and assumptions necessary to a comprehensive discussion and evaluation of data compression methods are discussed. The following string of characters is used to illustrate the concepts defined: EXAMPLE = aa bbb cccc ddddd eeeeee fffffffgggggggg.

Definitions

A code is a mapping of source messages (words from the source alphabet alpha) into codewords (words of the code alphabet beta). The source messages are the basic units into which the string to be represented is partitioned. These basic units may be single symbols from the source alphabet, or they may be strings of symbols. For string EXAMPLE, alpha = { a, b, c, d, e, f, g, space}. For purposes of explanation, beta will be taken to be { 0, 1 }. Codes can be categorized as block-block, block-variable, variable-block or variable-variable, where block-block indicates that the source messages and codewords are of fixed length and variable-variable codes map variable-length source messages into variable-length codewords. A block-block code for EXAMPLE is shown in Figure 1.1 and a variable-variable code is given in Figure 1.2. If the string EXAMPLE were coded using the Figure 1.1 code, the length of the coded message would be 120; using Figure 1.2 the length would be 30.

source message codeword source message codeword

a 000 aa 0

b 001 bbb 1

c 010 cccc 10

d 011 ddddd 11

e 100 eeeeee 100

f 101 fffffff 101

g 110 gggggggg 110

space 111 space 111

Classification of Methods

In addition to the categorization of data compression schemes with respect to message and codeword lengths, these methods are classified as either static or dynamic. A static method is one in which the mapping from the set of messages to the set of codewords is fixed before transmission begins, so that a given message is represented by the same codeword every time it appears in the message ensemble. The classic static defined-word scheme is Huffman coding [Huffman 1952]. In Huffman coding, the assignment of codewords to source messages is based on the probabilities with which the source messages appear in the message ensemble. Messages which appear more frequently are represented by short codewords; messages with smaller probabilities map to longer codewords. These probabilities are determined before transmission begins. A Huffman code for the ensemble EXAMPLE is given.

source message probability codeword

a 2/40 1001

b 3/40 1000

c 4/40 011

d 5/40 010

e 6/40 111

f 7/40 110

g 8/40 00

space 5/40 101

A code is dynamic if the mapping from the set of messages to the set of codewords changes over time. For example, dynamic Huffman coding involves computing an approximation to the probabilities of occurrence "on the fly", as the ensemble is being transmitted. The assignment of codewords to messages is based on the

values of the relative frequencies of occurrence at each point in time. A message x may be represented by a short codeword early in the transmission because it occurs frequently at the beginning of the ensemble, even though its probability of occurrence over the total ensemble is low. Later, when the more probable messages begin to occur with higher frequency, the short codeword will be mapped to one of the higher probability messages and x will be mapped to a longer codeword. As an illustration, Figure 1.4 presents a dynamic Huffman code table corresponding to the prefix aa bbb of EXAMPLE. Although the frequency of space over the entire message is greater than that of b, at this point in time b has higher frequency and therefore is mapped to the shorter codeword.

source message probability codeword

a 2/6 10

b 3/6 0

space 1/6 11

Dynamic codes are also referred to in the literature as adaptive, in that they adapt to changes in ensemble characteristics over time. The term adaptive will be used for the remainder of this paper; the fact that these codes adapt to changing characteristics is the source of their appeal. Some adaptive methods adapt to changing patterns in the source [Welch 1984] while others exploit locality of reference [Bentley et al. 1986]. Locality of reference is the tendency, common in a wide variety of text types, for a particular word to occur frequently for short periods of time then fall into disuse for long periods.

A Data Compression Model

In order to discuss the relative merits of data compression techniques, a framework for comparison must be established. There are two dimensions along which each of the schemes discussed here may be measured, algorithm complexity and amount of compression. When data compression is used in a data transmission application, the goal is speed. Speed of transmission depends upon the number of bits sent, the time required for the encoder to generate the coded message, and the time required for the decoder to recover the original ensemble. In a data storage application, although the

degree of compression is the primary concern, it is nonetheless necessary that the algorithm be efficient in order for the scheme to be practical. For a static scheme, there are three algorithms to analyze: the map construction algorithm, the encoding algorithm, and the decoding algorithm. For a dynamic scheme, there are just two algorithms: the encoding algorithm, and the decoding algorithm.

Several common measures of compression have been suggested: redundancy [Shannon and Weaver 1949], average message length [Huffman 1952], and compression ratio [Rubin 1976; Ruth and Kreutzer 1972]. These measures are defined below. Related to each of these measures are assumptions about the characteristics of the source. It is generally assumed in information theory that all statistical parameters of a message source are known with perfect accuracy [Gilbert 1971]. The most common model is that of a discrete memoryless source; a source whose output is a sequence of letters (or messages), each letter being a selection from some fixed alphabet a,... The letters are taken to be random, statistically independent selections from the alphabet, the selection being made according to some fixed probability assignment p(a),... [Gallager 1968]. Without loss of generality, the code alphabet is assumed to be {0,1} throughout this paper. The modifications necessary for larger code alphabets are straightforward.

It is assumed that any cost associated with the code letters is uniform. This is a reasonable assumption, although it omits applications like telegraphy where the code

symbols are of different durations. The assumption is also important, since the problem of constructing optimal codes over unequal code letter costs is a significantly different and more difficult problem. Perl et al. and Varn have developed algorithms for minimum-redundancy prefix coding in the case of arbitrary symbol cost and equal codeword probability [Perl et al. 1975; Varn 1971]. The assumption of equal probabilities mitigates the difficulty presented by the variable symbol cost. For the more general unequal letter costs and unequal probabilities model, Karp has proposed an integer linear programming approach [Karp 1961]. There have been several approximation algorithms proposed for this more difficult problem [Krause 1962; Cot 1977; Mehlhorn 1980].

When data is compressed, the goal is to reduce redundancy, leaving only the informational content. The measure of information of a source message x (in bits) is -lg p(x) [lg denotes the base 2 logarithm]. This definition has intuitive appeal; in the case that p(x=1, it is clear that x is not at all informative since it had to occur. Similarly, the smaller the value of p(x, the more unlikely x is to appear, hence the larger its information content. The reader is referred to Abramson for a longer, more elegant discussion of the

EXAMPLE is to be encoded one letter at a time, the entropy of its source can be calculated using the probabilities given in Figure 1.3: H = 2.894, so that the minimum number of bits contained in an encoding of EXAMPLE is 116. The Huffman code given in Section 1.2 does not quite achieve the theoretical minimum in this case.

Both of these definitions of information content are due to Shannon. A derivation of the concept of entropy as it relates to information theory is presented by Shannon [Shannon and Weaver 1949]. A simpler, more intuitive explanation of entropy is offered by Ash [Ash 1965]. The most common notion of a "good" code is one which is optimal in the sense of having minimum redundancy. Redundancy can be defined as: SUM p(a(i)) l(i) - SUM [-p(a(i)) lg p(a(i))] where l(i) is the length of the codeword representing message a(i). The expression SUM p(a(i)) l(i) represents the lengths of the codewords weighted by their probabilities of occurrence, that is, the average codeword length. The expression SUM [-p(a(i)) lg p(a(i))] is entropy, H. Thus, redundancy is a measure of the difference between average codeword length and average information content. If a code has minimum average codeword length for a given discrete probability distribution, it is said to be a minimum redundancy code.

We define the term local redundancy to capture the notion of redundancy caused by local properties of a message ensemble, rather than its global characteristics. While the model used for analyzing general-purpose coding techniques assumes a random distribution of the source messages, this may not actually be the case. In particular applications the tendency for messages to cluster in predictable patterns may be known. The existence of predictable patterns may be exploited to minimize local redundancy. Examples of applications in which local redundancy is common, and methods for dealing with local redundancy, are discussed in Section 2 and Section 6.2.

Huffman uses average message length, SUM p(a(i)) l(i), as a measure of the efficiency of a code. Clearly the meaning of this term is the average length of a coded message. We will use the term average codeword length to represent this quantity. Since redundancy is defined to be average codeword length minus entropy and entropy is constant for a given probability distribution, minimizing average codeword length minimizes redundancy.

Motivation

As discussed in the Introduction, data compression has wide application in terms of information storage, including representation of the abstract data type string [Standish 1980] and file compression. Huffman coding is used for compression in several file archival systems [ARC 1986; PKARC 1987], as is Lempel-Ziv coding, one of the adaptive schemes to be discussed in Section 5. An adaptive Huffman coding technique is the basis for the compact command of the UNIX operating system, and the UNIX compress utility employs the Lempel-Ziv approach [UNIX 1984].

In the area of data transmission, Huffman coding has been passed over for years in favor of block-block codes, notably ASCII. The advantage of Huffman coding is in the average number of bits per character transmitted, which may be much smaller than the lg n bits per character (where n is the source alphabet size) of a block-block system. The primary difficulty associated with variable-length codewords is that the rate at which bits are presented to the transmission channel will fluctuate, depending on the relative frequencies of the source messages. This requires buffering between the source and the channel. Advances in technology have both overcome this difficulty and contributed to the appeal of variable-length codes. Current data networks allocate communication resources to sources on the basis of need and provide buffering as part of the system. These systems require significant amounts of protocol, and fixed-length codes are quite inefficient for applications such as packet headers. In addition, communication costs are beginning to dominate storage and processing costs, so that variable-length coding schemes which reduce communication costs are attractive even if they are more complex. For these reasons, one could expect to see even greater use of variable-length coding in the future.

SEMANTIC DEPENDENT METHODS

Semantic dependent data compression techniques are designed to respond to specific types of local redundancy occurring in certain applications. One area in which data compression is of great importance is image representation and processing. There are two major reasons for this. The first is that digitized images contain a large amount of local redundancy. An image is usually captured in the form of an array of pixels whereas methods which exploit the tendency for pixels of like color or intensity to cluster together may be more efficient. The second reason for the abundance of research in this area is volume. Digital images usually require a very large number of bits, and many uses of digital images involve large collections of images.

One technique used for compression of image data is run length encoding. In a common version of run length encoding, the sequence of image elements along a scan line (row) is mapped into a sequence of pairs (c,l) where c represents an intensity or color and l the length of the run (sequence of pixels of equal intensity). For pictures such as weather maps, run length encoding can save a significant number of bits over the image element sequence [Gonzalez and Wintz 1977]. Another data compression technique specific to the area of image data is difference mapping, in which the image is represented as an array of differences in brightness (or color) between adjacent pixels rather than the brightness values themselves. Difference mapping was used to encode the pictures of Uranus transmitted by Voyager 2. The 8 bits per pixel needed to represent 256 brightness levels was reduced to an average of 3 bits per pixel when difference values were transmitted [Laeser et al. 1986]. In spacecraft applications, image fidelity is a major concern due to the effect of the distance from the spacecraft to earth on transmission reliability. Difference mapping was combined with error-correcting codes to provide both compression and data integrity in the Voyager project. Another method which takes advantage of the tendency for images to contain large areas of constant intensity is the use of the quadtree data structure [Samet 1984]. Additional examples of coding techniques used in image processing can be found in Wilkins and Wintz and in Cappellini [Wilkins and Wintz 1971; Cappellini 1985].

Data compression is of interest in business data processing, both because of the cost savings it offers and because of the large volume of data manipulated in many

business applications. The types of local redundancy present in business data files include runs of zeros in numeric fields, sequences of blanks in alphanumeric fields, and fields which are present in some records and null in others. Run length encoding can be used to compress sequences of zeros or blanks. Null suppression may be accomplished through the use of presence bits [Ruth and Kreutzer 1972]. Another class of methods exploits cases in which only a limited set of attribute values exist. Dictionary substitution entails replacing alphanumeric representations of information such as bank account type, insurance policy type, sex, month, etc. by the few bits necessary to represent the limited number of possible attribute values [Reghbati 1981].

Cormack describes a data compression system which is designed for use with database files [Cormack 1985]. The method, which is part of IBM's "Information Management System" (IMS), compresses individual records and is invoked each time a record is stored in the database file; expansion is performed each time a record is retrieved. Since records may be retrieved in any order, context information used by the compression routine is limited to a single record. In order for the routine to be applicable to any database, it must be able to adapt to the format of the record. The fact that database records are usually heterogeneous collections of small fields indicates that the local properties of the data are more important than its global characteristics. The compression routine in IMS is a hybrid method which attacks this local redundancy by using different coding schemes for different types of fields. The identified field types in IMS are letters of the alphabet, numeric digits, packed decimal digit pairs, blank, and other. When compression begins, a default code is used to encode the first character of the record. For each subsequent character, the type of the previous character determines the code to be used. For example, if the record 01870_ABCD__LMN were encoded with the letter code as default, the leading zero would be coded using the letter code; the 1, 8, 7, 0 and the first blank (_) would be coded by the numeric code. The A would be coded by the blank code; B, C, D, and the next blank by the letter code; the next blank and the L by the blank code; and the M and N by the letter code. Clearly, each code must define a codeword for every character; the letter code would assign the shortest codewords to letters, the numeric code would favor the digits, etc. In the system Cormack describes, the types of the characters are stored in the encode/decode data structures. When a character c is received, the decoder checks type(c) to detect which code table will be used in transmitting the next character. The compression algorithm might be more efficient if a special bit string were used to alert the receiver to a change in code table. Particularly if fields were reasonably long, decoding would be more rapid and the extra bits in the transmission would not be excessive. Cormack reports that the performance of the IMS

compression routines is very good; at least fifty sites are currently using the system. He cites a case of a database containing student records whose size was reduced by 42.1%, and as a side effect the number of disk operations required to load the database was reduced by 32.7% [Cormack 1985].

A variety of approaches to data compression designed with text files in mind include use of a dictionary either representing all of the words in the file so that the file itself is coded as a list of pointers to the dictionary [Hahn 1974], or representing common words and word endings so that the file consists of pointers to the dictionary and encodings of the less common words [Tropper 1982]. Hand-selection of common phrases [Wagner 1973], programmed selection of prefixes and suffixes [Fraenkel et al. 1983] and programmed selection of common character pairs [Snyderman and Hunt 1970; Cortesi 1982] have also been investigated.

This discussion of semantic dependent data compression techniques represents a limited sample of a very large body of research. These methods and others of a like nature are interesting and of great value in their intended domains. Their obvious drawback lies in their limited utility. It should be noted, however, that much of the efficiency gained through the use of semantic dependent techniques can be achieved through more general methods, albeit to a lesser degree. For example, the dictionary approaches can be implemented through either Huffman coding ( Section 3.2, Section 4) or Lempel-Ziv codes (Section 5.1). Cormack's database scheme is a special case of the codebook approach (Section 3.2), and run length encoding is one of the effects of Lempel-Ziv codes.

STATIC DEFINED-WORD SCHEMES

Shannon-Fano Coding

The Shannon-Fano technique has as an advantage its simplicity. The code is constructed as follows: the source messages a(i) and their probabilities p( a(i) ) are listed in order of nonincreasing probability. This list is then divided in such a way as to form two groups of as nearly equal total probabilities as possible. Each message in the first group receives 0 as the first digit of its codeword; the messages in the second half have codewords beginning with 1. Each of these groups is then divided according to the same criterion and additional code digits are appended. The process is continued until each subset contains only one message. Clearly the Shannon-Fano algorithm yields a minimal prefix code.

a 1/2 0

b 1/4 10

c 1/8 110

d 1/16 1110

e 1/32 11110

f 1/32 11111

A Shannon-Fano Code.

Figure 3.1 shows the application of the method to a particularly simple probability distribution. The length of each codeword x is equal to -lg p(x). This is true as long as it is possible to divide the list into subgroups of exactly equal probability. When this is not possible, some codewords may be of length -lg p(x)+1. The Shannon-Fano algorithm yields an average codeword length S which satisfies H <= S <= H + 1. In Figure 3.2, the Shannon-Fano code for ensemble EXAMPLE is given. As is often the case, the average codeword length is the same as that achieved by the Huffman code (see Figure 1.3). That the Shannon-Fano algorithm is not guaranteed to produce an optimal code is demonstrated by the following set of probabilities: { .35, .17, .17, .16, .15 }. The Shannon-Fano code for this distribution is compared with the Huffman code

g 8/40 00

f 7/40 010

e 6/40 011

d 5/40 100

space 5/40 101

c 4/40 110

b 3/40 1110

a 2/40 1111

A Shannon-Fano Code for EXAMPLE

(code length=117).

Static Huffman Coding

Huffman's algorithm, expressed graphically, takes as input a list of nonnegative weights {w(1), ... ,w(n) } and constructs a full binary tree [a binary tree is full if every node has either zero or two children] whose leaves are labeled with the weights. When the Huffman algorithm is used to construct a code, the weights represent the probabilities associated with the source letters. Initially there is a set of singleton trees, one for each weight in the list. At each step in the algorithm the trees corresponding to the two smallest weights, w(i) and w(j), are merged into a new tree whose weight is w(i)+w(j) and whose root has two children which are the subtrees represented by w(i) and w(j). The weights w(i) and w(j) are removed from the list and w(i)+w(j) is inserted into the list. This process continues until the weight list contains a single value. If, at any time, there is more than one way to choose a smallest pair of weights, any such pair may be chosen. In Huffman's paper, the process begins with a nonincreasing list of weights. This detail is not important to the correctness of the algorithm, but it does provide a more efficient implementation [Huffman 1952]. The Huffman algorithm is demonstrated .

The Huffman algorithm determines the lengths of the codewords to be mapped to each of the source letters a(i). There are many alternatives for specifying the actual digits; it is necessary only that the code have the prefix property. The usual assignment entails labeling the edge from each parent to its left child with the digit 0 and the edge to the right child with 1. The codeword for each source letter is the sequence of labels along the path from the root to the leaf node representing that letter. The codewords for the source of Figure 3.3, in order of decreasing probability, are {01,11,001,100,101,000,0001}. Clearly, this process yields a minimal prefix code. Further, the algorithm is guaranteed to

produce an optimal (minimum redundancy) code [Huffman 1952]. Gallager has proved an upper bound on the redundancy of a Huffman code of p(n) + lg [(2 lg e)/e] which is approximately p(n) + 0.086, where p(n) is the probability of the least likely source message [Gallager 1978]. In a recent paper, Capocelli et al. provide new bounds which are tighter than those of Gallagher for some probability distributions [Capocelli et al. 1986]. Figure 3.4 shows a distribution for which the Huffman code is optimal while the Shannon-Fano code is not.

In addition to the fact that there are many ways of forming codewords of appropriate lengths, there are cases in which the Huffman algorithm does not uniquely determine these lengths due to the arbitrary choice among equal minimum weights. As an example, codes with codeword lengths of {1,2,3,4,4} and of {2,2,2,3,3} both yield the same average codeword length for a source with probabilities {.4,.2,.2,.1,.1}. Schwartz defines a variation of the Huffman algorithm which performs "bottom merging"; that is, orders a new parent node above existing nodes of the same weight and always merges the last two weights in the list. The code constructed is the Huffman code with minimum values of maximum codeword length (MAX{ l(i) }) and total codeword length (SUM{ l(i) }) [Schwartz 1964]. Schwartz and Kallick describe an implementation of Huffman's algorithm with bottom merging [Schwartz and Kallick 1964]. The Schwartz-Kallick algorithm and a later algorithm by Connell [Connell 1973] use Huffman's procedure to determine the lengths of the codewords, and actual digits are assigned so that the code has the numerical sequence property. That is, codewords of equal length form a consecutive sequence of binary numbers. Shannon-Fano codes also have the numerical sequence property. This property can be exploited to achieve a compact representation of the code and rapid encoding and decoding.

S-F Huffman

a(1) 0.35 00 1

a(2) 0.17 01 011

a(3) 0.17 10 010

a(4) 0.16 110 001

a(5) 0.15 111 000

Average codeword length 2.31 2.30

Universal Codes and Representations of the Integers

A code is universal if it maps source messages to codewords so that the resulting average codeword length is bounded by c1(H + c2). That is, given an arbitrary source with nonzero entropy, a universal code achieves average codeword length which is at most a constant times the optimal possible for that source. The potential compression offered by a universal code clearly depends on the magnitudes of the constants c1 and c2. We recall the definition of an asymptotically optimal code as one for which average codeword length approaches entropy and remark that a universal code with c1=1 is asymptotically optimal.

An advantage of universal codes over Huffman codes is that it is not necessary to know the exact probabilities with which the source messages appear. While Huffman coding is not applicable unless the probabilities are known, it is sufficient in the case of universal coding to know the probability distribution only to the extent that the source messages can be ranked in probability order. By mapping messages in order of decreasing probability to codewords in order of increasing length, universality can be achieved. Another advantage to universal codes is that the codeword sets are fixed. It is not necessary to compute a codeword set based upon the statistics of an ensemble; any universal codeword set will suffice as long as the source messages are ranked. The encoding and decoding processes are thus simplified. While universal codes can be used instead of Huffman codes as general-purpose static schemes, the more common application is as an adjunct to a dynamic scheme. This type of application will be demonstrated in Section 5.

Since the ranking of source messages is the essential parameter in universal coding, we may think of a universal code as representing an enumeration of the source messages, or as representing the integers, which provide an enumeration. Elias defines a sequence of universal coding schemes which map the set of positive integers onto the set of binary codewords [Elias 1975].

gamma delta

1 1 1

2 010 0100

3 011 0101

4 00100 01100

5 00101 01101

6 00110 01110

7 00111 01111

8 0001000 00100000

16 000010000 001010000

17 000010001 001010001

32 00000100000 0011000000

The first Elias code is one which is simple but not optimal. This code, gamma, maps an integer x onto the binary value of x prefaced by floor( lg x) zeros. The binary value of x is expressed in as few bits as possible, and therefore begins with a 1, which serves to delimit the prefix. The result is an instantaneously decodable code since the total length of a codeword is exactly one greater than twice the number of zeros in the prefix; therefore, as soon as the first 1 of a codeword is encountered, its length is known. The code is not a minimum redundancy code since the ratio of expected codeword length to entropy goes to 2 as entropy approaches infinity. The second code, delta, maps an integer x to a codeword consisting of gamma (floor[lg x] +1) followed by the binary value of x with the leading 1 deleted. The resulting codeword has length floor[lg x] + 2 floor[lg (1 + floor[lg x] ) ] + 1. This concept can be applied recursively to shorten the codeword lengths, but the benefits decrease rapidly. The code delta is asymptotically optimal since the limit of the ratio of expected codeword length to entropy is 1. Figure 3.5 lists the values of gamma and delta for a sampling of the integers. Figure 3.6 shows an Elias code for string EXAMPLE. The number of bits transmitted using this mapping would be 161, which does not compare well with the 117 bits transmitted by the Huffman code of Figure 1.3. Huffman coding is optimal under the static mapping model. Even an asymptotically optimal universal code cannot compare with static Huffman coding on a source for which the probabilities of the messages are known.

Source Frequency Rank Codeword

message

ADAPTIVE HUFFMAN CODING

Adaptive Huffman coding was first conceived independently by Faller and Gallager [Faller 1973; Gallager 1978]. Knuth contributed improvements to the original algorithm [Knuth 1985] and the resulting algorithm is referred to as algorithm FGK. A more recent version of adaptive Huffman coding is described by Vitter [Vitter 1987]. All of these methods are defined-word schemes which determine the mapping from source messages to codewords based upon a running estimate of the source message probabilities. The code is adaptive, changing so as to remain optimal for the current estimates. In this way, the adaptive Huffman codes respond to locality. In essence, the encoder is "learning" the characteristics of the source. The decoder must learn along with the encoder by continually updating the Huffman tree so as to stay in synchronization with the encoder.

Another advantage of these systems is that they require only one pass over the data. Of course, one-pass methods are not very interesting if the number of bits they transmit is significantly greater than that of the two-pass scheme. Interestingly, the performance of these methods, in terms of number of bits transmitted, can be better than that of static Huffman coding. This does not contradict the optimality of the static method as the static method is optimal only over all methods which assume a time-invariant mapping. The performance of the adaptive methods can also be worse than that of the static method. Upper bounds on the redundancy of these methods are presented in this section. As discussed in the introduction, the adaptive method of Faller, Gallager and Knuth is the basis for the UNIX utility compact. The performance of compact is quite good, providing typical compression factors of 30-40%.

Algorithm FGK

The basis for algorithm FGK is the Sibling Property, defined by Gallager [Gallager 1978]: A binary code tree has the sibling property if each node (except the root) has a sibling and if the nodes can be listed in order of nonincreasing weight with each node adjacent to its sibling. Gallager proves that a binary prefix code is a Huffman code if and only if the code tree has the sibling property. In algorithm FGK, both sender and receiver maintain dynamically changing Huffman code trees. The leaves of the code tree represent the source messages and the weights of the leaves represent frequency counts for the messages. At any point in time, k of the n possible source messages have occurred in the message ensemble.

Figure 4.1 -- Algorithm FGK processing the ensemble EXAMPLE (a) Tree after processing "aa bb"; 11 will be transmitted for the next b. (b) After encoding the third b; 101 will be transmitted for the next space; the tree will not change; 100 will be transmitted for the first c. (c) Tree after update following first c.

Initially, the code tree consists of a single leaf node, called the 0-node. The 0-node is a special node used to represent the n-k unused messages. For each message transmitted, both parties must increment the corresponding weight and recompute the code tree to maintain the sibling property. At the point in time when t messages have been transmitted, k of them distinct, and k <>

Disregarding overhead, the number of bits transmitted by algorithm FGK for the EXAMPLE is 129. The static Huffman algorithm would transmit 117 bits in processing the same data. The overhead associated with the adaptive method is actually less than that of the static algorithm. In the adaptive case the only overhead is the n lg n bits needed to represent each of the n different source messages when they appear for the first time. (This is in fact conservative; rather than transmitting a unique code for each of the n source messages, the sender could transmit the message's position in the list of remaining messages and save a few bits in the average case.) In the static case, the source messages need to be sent as does the shape of the code tree. As discussed in Section 3.2, an efficient representation of the tree shape requires 2n bits. Algorithm FGK compares well with static Huffman coding on this ensemble when overhead is taken into account. Figure 4.3 illustrates an example on which algorithm FGK performs better than static Huffman coding even without taking overhead into account. Algorithm FGK transmits 47 bits for this ensemble while the static Huffman code requires 53.

Algorithm V

The adaptive Huffman algorithm of Vitter (algorithm V) incorporates two improvements over algorithm FGK. First, the number of interchanges in which a node is moved upward in the tree during a recomputation is limited to one. This number is bounded in algorithm FGK only by l/2 where l is the length of the codeword for a(t+1) when the recomputation begins. Second, Vitter's method minimizes the values of SUM{ l(i) } and MAX{l(i)} subject to the requirement of minimizing SUM{ w(i) l(i) }. The intuitive explanation of algorithm V's advantage over algorithm FGK is as follows: as in algorithm FGK, the code tree constructed by algorithm V is the Huffman code tree for the prefix of the ensemble seen so far. The adaptive methods do not assume that the relative frequencies of a prefix represent accurately the symbol probabilities over the entire message. Therefore, the fact that algorithm V guarantees a tree of minimum height (height = MAX{ l(i) } and minimum external path length (SUM{ l(i) }) implies that it is better suited for coding the next message of the ensemble, given that any of the leaves of the tree may represent that next message.

These improvements are accomplished through the use of a new system for numbering nodes. The numbering, called an implicit numbering, corresponds to a level ordering of the nodes (from bottom to top and left to right). Figure 4.4 illiustrates that the numbering of algorithm FGK is not always a level ordering. The following invariant is maintained in Vitter's algorithm: For each weight w, all leaves of weight w precede (in the implicit numbering) all internal nodes of weight w. Vitter proves that this invariant enforces the desired bound on node promotions [Vitter 1987]. The invariant also implements bottom merging, as discussed in Section 3.2, to minimize SUM{ l(i) } and MAX{ l(i) }. The difference between Vitter's method and algorithm FGK is in the way the tree is updated between transmissions. In order to understand the revised update operation, the following definition of a block of nodes is necessary: Blocks are equivalence classes of nodes defined by u is equivalent to v iff weight(u) = weight(v) and u and v are either both leaves or both internal nodes. The leader of a block is the highest-numbered (in the implicit numbering) node in the block. Blocks are ordered by increasing weight with the convention that a leaf block always precedes an internal block of the same weight. When an exchange of nodes is required to maintain the sibling property, algorithm V requires that the node being promoted be moved to the position currently occupied by the highest-numbered node in the target block.

ilustrates the tree built by Vitter's method for the ensemble of Figure 4.3. Both SUM{l(i)} and MAX{l(i)} are smaller in the tree of Figure 4.6. The number of bits transmitted during the processing of the sequence is 47, the same used by algorithm FGK. However, if the transmission continues with d,b,c,f or an unused letter, the cost of algorithm V will be less than that of algorithm FGK. This again illustrates the benefit of minimizing the external path length SUM{l(i)} and the height MAX{l(i)}.

It should be noted again that the strategy of minimizing external path length and height is optimal under the assumption that any source letter is equally likely to occur next. Other reasonable strategies include one which assumes locality. To take advantage of locality, the ordering of tree nodes with equal weights could be determined on the basis of recency. Another reasonable assumption about adaptive coding is that the weights in the current tree correspond closely to the probabilities associated with the source. This assumption becomes more reasonable as the length of the ensemble increases. Under this assumption, the expected cost of transmitting the next letter is SUM{ p(i) l(i) } which is approximately SUM{ w(i) l(i) }, so that neither algorithm FGK nor algorithm V has any advantage.

Vitter proves that the performance of his algorithm is bounded by S - n + 1 from below and S + t - 2n + 1 from above [Vitter 1987]. At worst then, Vitter's adaptive method may transmit one more bit per codeword than the static Huffman method. The improvements made by Vitter do not change the complexity of the algorithm; algorithm V encodes and decodes in O(l) time as does algorithm FGK.

OTHER ADAPTIVE METHODS

Two more adaptive data compression methods, algorithm BSTW and Lempel-Ziv coding, are discussed in this section. Like the adaptive Huffman coding techniques, these methods do not require a first pass to analyze the characteristics of the source. Thus, they provide coding and transmission in real time. However, these schemes diverge from the fundamental Huffman coding approach to a greater degree than the methods discussed in Section 4. Algorithm BSTW is a defined-word scheme which attempts to exploit locality. Lempel-Ziv coding is a free-parse method; that is, the words of the source alphabet are defined dynamically, as the encoding is performed. Lempel-Ziv coding is the basis for the UNIX utility compress. Algorithm BSTW is a variable-variable scheme, while Lempel-Ziv coding is variable-block.

Lempel-Ziv Codes

Lempel-Ziv coding represents a departure from the classic view of a code as a mapping from a fixed set of source messages (letters, symbols or words) to a fixed set of codewords. We coin the term free-parse to characterize this type of code, in which the set of source messages and the codewords to which they are mapped are defined as the algorithm executes. While all adaptive methods create a set of codewords dynamically, defined-word schemes have a fixed set of source messages, defined by context (eg., in text file processing the source messages might be single letters; in Pascal source file processing the source messages might be tokens). Lempel-Ziv coding defines the set of source messages as it parses the ensemble.

The Lempel-Ziv algorithm consists of a rule for parsing strings of symbols from a finite alphabet into substrings, or words, whose lengths do not exceed a prescribed integer L(1); and a coding scheme which maps these substrings sequentially into uniquely decipherable codewords of fixed length L(2) [Ziv and Lempel 1977]. The strings are selected so that they have very nearly equal probability of occurrence. As a result, frequently-occurring symbols are grouped into longer strings while infrequent symbols appear in short strings. This strategy is effective at exploiting redundancy due to symbol frequency, character repetition, and high-usage patterns. Figure 5.1 shows a small Lempel-Ziv code table. Low-frequency letters such as Z are assigned individually to fixed-length codewords (in this case, 12 bit binary numbers represented in base ten for readability). Frequently-occurring symbols, such as blank (represented by _) and zero, appear in long strings. Effective compression is achieved when a long string is replaced by a single 12-bit code.

Symbol string Code

A 1

T 2

AN 3

TH 4

THE 5

AND 6

AD 7

_ 8

__ 9

___ 10

0 11

00 12

000 13

0000 14

Z 15

### 4095

A Lempel-Ziv code table.

The Lempel-Ziv method is an incremental parsing strategy in which the coding process is interlaced with a learning process for varying source characteristics [Ziv and Lempel 1977]. In Figure 5.1, run-length encoding of zeros and blanks is being learned.

The Lempel-Ziv algorithm parses the source ensemble into a collection of segments of gradually increasing length. At each encoding step, the longest prefix of the remaining source ensemble which matches an existing table entry (alpha) is parsed off, along with the character (c) following this prefix in the ensemble. The new source message, alpha c, is added to the code table. The new table entry is coded as (i,c) where i is the codeword for the existing table entry and c is the appended character. For example, the ensemble 010100010 is parsed into { 0, 1, 01, 00, 010 } and is coded as { (0,0), (0,1), (1,1), (1,0), (3,0) }. The table built for the message ensemble EXAMPLE is shown in Figure 5.2. The coded ensemble has the form: { (0,a), (1,space), (0,b), (3,b), (0,space), (0,c), (6,c), (6,space), (0,d), (9,d), (10,space), (0,e), (12,e), (13,e), (5,f), (0,f), (16,f), (17,f), (0,g), (19,g), (20,g), (20) }. The string table is represented in a more efficient manner than in Figure 5.1; the string is represented by its prefix codeword followed by the extension character, so that the table entries have fixed length. The Lempel-Ziv strategy is simple, but greedy. It simply parses off the longest recognized string each time rather than searching for the best way to parse the ensemble. Message Codeword

a 1

1space 2

b 3

3b 4

space 5

c 6

6c 7

6space 8

d 9

9d 10

10space 11

Message Codeword

e 12

12e 13

13e 14

5f 15

f 16

16f 17

17f 18

g 19

19g 20

20g 21

Figure 5.2 -- Lempel-Ziv table for the message ensemble EXAMPLE

(code length=173).

The Lempel-Ziv method specifies fixed-length codewords. The size of the table and the maximum source message length are determined by the length of the codewords. It should be clear from the definition of the algorithm that Lempel-Ziv codes tend to be quite inefficient during the initial portion of the message ensemble. For example, even if we assume 3-bit codewords for characters a through g and space and 5-bit codewords for table indices, the Lempel-Ziv algorithm transmits 173 bits for ensemble EXAMPLE. This compares poorly with the other methods discussed in this survey. The ensemble must be sufficiently long for the procedure to build up enough symbol frequency experience to achieve good compression over the full ensemble.

If the codeword length is not sufficiently large, Lempel-Ziv codes may also rise slowly to reasonable efficiency, maintain good performance briefly, and fail to make any gains once the table is full and messages can no longer be added. If the ensemble's characteristics vary over time, the method may be "stuck with" the behavior it has learned and may be unable to continue to adapt.

Algorithm BSTW

The most recent of the algorithms surveyed here is due to Bentley, Sleator, Tarjan and Wei [Bentley et al. 1986]. This method, algorithm BSTW, possesses the advantage that it requires only one pass over the data to be transmitted yet has performance which compares well to that of the static two-pass method along the dimension of number of bits per word transmitted. This number of bits is never much larger than the number of bits transmitted by static Huffman coding (in fact, is usually quite close), and can be significantly better. Algorithm BSTW incorporates the additional benefit of taking advantage of locality of reference, the tendency for words to occur frequently for short periods of time then fall into long periods of disuse. The algorithm uses a self-organizing list as an auxiliary data structure and employs shorter encodings for words near the front of this list. There are many strategies for maintaining self-organizing lists (see [Hester and Hirschberg 1985]); algorithm BSTW uses move-to-front.

A simple example serves to outline the method of algorithm BSTW. As in other adaptive schemes, sender and receiver maintain identical representations of the code; in this case message lists which are updated at each transmission, using the move-to-front heuristic. These lists are initially empty. When message a(t) is transmitted, if a(t) is on the sender's list, he transmits its current position. He then updates his list by moving a(t) to position 1 and shifting each of the other messages down one position. The receiver similarly alters his word list. If a(t) is being transmitted for the first time, then k+1 is the "position" transmitted, where k is the number of distinct messages transmitted so far. Some representation of the message itself must be transmitted as well, but just this first time. Again, a(t) is moved to position one by both sender and receiver subsequent to its transmission. For the ensemble "abcadeabfd", the transmission would be 1 a 2 b 3 c 3 4 d 5 e 3 5 6 f 5. (for ease of presentation, list positions are represented in base ten).

As the example shows, algorithm BSTW transmits each source message once; the rest of its transmission consists of encodings of list positions. Therefore, an essential feature of algorithm BSTW is a reasonable scheme for representation of the integers. The methods discussed by Bentley et al. are the Elias codes presented in Section 3.3. The simple scheme, code gamma, involves prefixing the binary representation of the integer i with floor[ lg i ] zeros. This yields a prefix code with the length of the codeword for i equal to 2 floor[ lg i ] + 1. Greater compression can be gained through use of the more sophisticated scheme, delta, which encodes an integer i in 1 + floor[ lg i ] + 2 floor[ (lg (1 + floor[ lg i ] ) ] bits.

A message ensemble on which algorithm BSTW is particularly efficient, described by Bentley et al., is formed by repeating each of n messages n times, for example 1^n 2^n 3^n ... n^n. Disregarding overhead, a static Huffman code uses n^2 lg n bits (or lg n bits per message), while algorithm BSTW uses n^2 + 2 SUM{ i=1 to n} floor[ lg i ] (which is less than or equal to n^2 + 2 n lg n, or O(1) bits per message). The overhead for algorithm BSTW consists of just the n lg n bits needed to transmit each source letter once. As discussed in Section 3.2, the overhead for static Huffman coding includes an additional 2n bits. The locality present in ensemble EXAMPLE is similar to that in the above example. The transmission effected by algorithm BSTW is: 1 a 1 2 space 3 b 1 1 2 4 c 1 1 1 2 5 d 1 1 1 1 2 6 e 1 1 1 1 1 2 7 f 1 1 1 1 1 1 8 g 1 1 1 1 1 1 1. Using 3 bits for each source letter (a through g and space) and the Elias code delta for list positions, the number of bits used is 81, which is a great improvement over all of the other methods discussed (only 69% of the length used by static Huffman coding). This could be improved further by the use of Fibonacci codes for list positions.

In [Bentley et al. 1986] a proof is given that with the simple scheme for encoding integers, the performance of algorithm BSTW is bounded above by 2 S + 1, where S is the cost of the static Huffman coding scheme. Using the more sophisticated integer encoding scheme, the bound is 1 + S + 2 lg(1 + S). A key idea in the proofs given by Bentley et al. is the fact that, using the move-to-front heuristic, the integer transmitted for a message a(t) will be one more than the number of different words transmitted since the last occurrence of a(t). Bentley et al. also prove that algorithm BSTW is asymptotically optimal.

EMPIRICAL RESULTS

Empirical tests of the efficiencies of the algorithms presented here are reported in [Bentley et al. 1986; Knuth 1985; Schwartz and Kallick 1964; Vitter 1987; Welch 1984]. These experiments compare the number of bits per word required and processing time is not reported. While theoretical considerations bound the performance of the various algorithms, experimental data is invaluable in providing additional insight. It is clear that the performance of each of these methods is dependent upon the characteristics of the source ensemble.

Schwartz and Kallick test an implementation of static Huffman coding in which bottom merging is used to determine codeword lengths and all codewords of a given length are sequential binary numbers [Schwartz and Kallick 1964]. The source alphabet in the experiment consists of 5,114 frequently-used English words, 27 geographical names, 10 numerals, 14 symbols, and 43 suffixes. The entropy of the document is 8.884 binary digits per message and the average codeword constructed has length 8.920. The same document is also coded one character at a time. In this case, the entropy of the source is 4.03 and the coded ensemble contains an average of 4.09 bits per letter. The redundancy is low in both cases. However, the relative redundancy (i.e., redundancy/entropy) is lower when the document is encoded by words.

Knuth describes algorithm FGK's performance on three types of data: a file containing the text of Grimm's first ten Fairy Tales, text of a technical book, and a file of graphical data [Knuth 1985]. For the first two files, the source messages are individual characters and the alphabet size is 128. The same data is coded using pairs of characters, so that the alphabet size is 1968. For the graphical data, the number of source messages is 343. In the case of the Fairy Tales the performance of FGK is very close to optimum, although performance degrades with increasing file size. Performance on the technical book is not as good, but is still respectable. The graphical data proves harder yet to compress, but again FGK performs reasonably well. In the latter two cases, the trend of performance degradation with file size continues. Defining source messages to consist of character pairs results in slightly better compression, but the difference would not appear to justify the increased memory requirement imposed by the larger alphabet.

n k Static Alg. V Alg. FGK

100 96 83.0 71.1 82.4

500 96 83.0 80.8 83.5

961 97 83.5 82.3 83.7

Simulation results for a small text file [Vitter 1987];

n = file size in 8-bit bytes,

k = number of distinct messages.

Vitter tests the performance of algorithms V and FGK against that of static Huffman coding. Each method is run on data which includes Pascal source code, the TeX source of the author's thesis, and electronic mail files [Vitter 1987]. Figure 6.1 summarizes the results of the experiment for a small file of text. The performance of each algorithm is measured by the number of bits in the coded ensemble and overhead costs are not included. Compression achieved by each algorithm is represented by the size of the file it creates, given as a percentage of the original file size. Figure 6.2 presents data for Pascal source code. For the TeX source, the alphabet consists of 128 individual characters; for the other two file types, no more than 97 characters appear. For each experiment, when the overhead costs are taken into account, algorithm V outperforms static Huffman coding as long as the size of the message ensemble (number of characters) is no more than 10^4. Algorithm FGK displays slightly higher costs, but never more than 100.4% of the static algorithm.

n k Static Alg.V Alg. FGK

100 32 57.4 56.2 58.9

500 49 61.5 62.2 63.0

1000 57 61.3 61.8 62.4

10000 73 59.8 59.9 60.0

12067 78 59.6 59.8 59.9

Simulation results for Pascal source code [Vitter 1987];

n = file size in bytes,

k = number of distinct messages.

SUSCEPTIBILITY TO ERROR

The discrete noiseless channel is, unfortunately, not a very realistic model of a communication system. Actual data transmission systems are prone to two types of error: phase error, in which a code symbol is lost or gained; and amplitude error, in which a code symbol is corrupted [Neumann 1962]. The degree to which channel errors degrade transmission is an important parameter in the choice of a data compression method. The susceptibility to error of a coding algorithm depends heavily on whether the method is static or adaptive.

Static Codes

It is generally known that Huffman codes tend to be self-correcting [Standish 1980]. That is, a transmission error tends not to propagate too far. The codeword in which the error occurs is incorrectly received and it is likely that several subsequent codewords are misinterpreted but, before too long, the receiver is back in synchronization with the sender. In a static code, synchronization means simply that both sender and receiver identify the beginnings of the codewords in the same way. In Figure 7.1, an example is used to illustrate the ability of a Huffman code to recover from phase errors. The message ensemble "BCDAEB" is encoded using the Huffman code of Figure 3.4 where the source letters a(1) ... a(5) represent A ... E respectively, yielding the coded ensemble "0110100011000011". Figure 7.1 demonstrates the impact of loss of the first bit, the second bit, or the fourth bit. The dots show the way in which each line is parsed into codewords. The loss of the first bit results in re-synchronization after the third bit so that only the first source message (B) is lost (replaced by AA). When the second bit is lost, the first eight bits of the coded ensemble are misinterpreted and synchronization is regained by bit 9. Dropping the fourth bit causes the same degree of disturbance as dropping the second.

011.010.001.1.000.011. coded ensemble BCDAEB

1.1.010.001.1.000.011. bit 1 is lost, interpreted as AACDAEB

010.1.000.1.1.000.011. bit 2 is lost, interpreted as CAEAAEB

011.1.000.1.1.000.011. bit 4 is lost, interpreted as BAEAAEB

(Recovery from phase errors)

0 1 1.0 1 0.00 1.1.000.011 coded ensemble (BCDAEB)

1.1.1.0 1 0.00 1.1.000.011 bit 1 is inverted, interpreted as DCDAEB

0 0 1.0 1 0.00 1.1.000.011 bit 2 is inverted, interpreted as AAACDAEB

0 1 1.1.1.0 00.1.1.000.011 bit 4 is inverted, interpreted as BAAEAAEB

(Recovery from amplitude errors)

The effect of amplitude errors is demonstrated in Figure 7.2. The format of the illustration is the same as that in Figure 7.1. This time bits 1, 2, and 4 are inverted rather than lost. Again synchronization is regained almost immediately. When bit 1 or bit 2 is changed, only the first three bits (the first character of the ensemble) are disturbed. Inversion of bit four causes loss of synchronization through the ninth bit. A very simple explanation of the self-synchronization present in these example can be given. Since many of the codewords end in the same sequence of digits, the decoder is likely to reach a leaf of the Huffman code tree at one of the codeword boundaries of the original coded ensemble. When this happens, the decoder is back in synchronization with the encoder. So that self-synchronization may be discussed more carefully, the following definitions are presented. (It should be noted that these definitions hold for arbitrary prefix codes, so that the discussion includes all of the codes described in Section 3.) If s is a suffix of some codeword and there exist sequences of codewords Gamma and Delta such that s Gamma = Delta, then Gamma is said to be a synchronizing sequence for s. For example, in the Huffman code used above, 1 is a synchronizing sequence for the suffix 01 while both 000001 and 011 are synchronizing sequences for the suffix 10. If every suffix (of every codeword) has a synchronizing sequence, then the code is completely self-synchronizing. If some or none of the proper suffixes have synchronizing sequences, then the code is, respectively, partially- or never-self-synchronizing. Finally, if there exists a sequence Gamma which is a synchronizing sequence for every suffix, Gamma is defined to be a universal synchronizing sequence. The code used in the examples above is completely self-synchronizing, and has universal synchronizing sequence 00000011000. Gilbert and Moore prove that the existence of a universal synchronizing sequence is a necessary as well as a sufficient condition for a code to be completely self-synchronizing [Gilbert and Moore 1959]. They also state that any prefix code which is completely self-synchronizing will synchronize itself with probability 1 if the source ensemble consists of successive messages independently chosen with any given set of probabilities. This is true since the probability of occurrence of the universal synchronizing sequence at any given time is positive.

The Elias codes of Section 3.3 are not at all robust. Each of the codes gamma and delta can be thought of as generating codewords which consist of a number of substrings such that each substring encodes the length of the subsequent substring. For code gamma we may think of each codeword gamma(x) as the concatenation of z, a string of n zeros, and b, a string of length n+1 (n = floor[ lg x ] ). If one of the zeros in substring z is lost, synchronization will be lost as the last symbol of b will be pushed into the next codeword.

Since the 1 at the front of substring b delimits the end of z, if a zero in z is changed to a 1, synchronization will be lost as symbols from b are pushed into the following codeword. Similarly, if ones at the front of b are inverted to zeros, synchronization will be lost as the codeword gamma(x) consumes symbols from the following codeword. Once synchronization is lost, it cannot normally be recovered.

codewords gamma(6), gamma(4), gamma(8) are used to illustrate the above ideas. In each case, synchronization is lost and never recovered.

001 1 0.00 1 00.0 00100 0. coded integers 6, 4, 8

0 1 1.0 00 1 00 0.00100.0 bit 2 is lost, interpreted as 3, 8, 2, etc.

011.1.0 00 1 00 0.00100.0 bit 2 is inverted, interpreted as 3, 1, 8, 4, etc.

000 1 0 00.1.00 0 00100 0 bit 3 is inverted, interpreted as 8, 1, etc.

(Effects of errors in Elias Codes).

In Figure 7.4, some illustrations based on the Fibonacci coding of ensemble EXAMPLE as shown in Figure 3.9 are given. When bit 3 (which is not part of a 11 substring) is lost or changed, only a single codeword is degraded. When bit 6 (the final bit of the first codeword) is lost or changed, the first two codewords are incorrectly decoded. When bit 20 is changed, the first b is incorrectly decoded as fg.

000011.000011.00011.010 11.01011. coded ensemble "aa bb".

00 011.000011.00011.010 11.01011. bit 3 is lost, interpreted as " a bb".

001011.000011.00011.010 11.01011. bit 3 is inverted, interpreted as "?a bb".

00001 000011.00011.010 11.01011. bit 6 is lost, interpreted as "? bb".

000010 000011.00011.010 11.01011. bit 6 is inverted, interpreted as "? bb".

000011.000011.00011.011.11.01011. bit 20 is inverted, interpreted as "aa fgb".

(Effects of errors in Fibonacci Codes.)

Adaptive Codes

Adaptive codes are far more adversely affected by transmission errors than are static codes. For example, in the case of a adaptive Huffman code, even though the receiver may re-synchronize with the sender in terms of correctly locating the beginning of a codeword, the information lost represents more than a few bits or a few characters of the source ensemble. The fact that sender and receiver are dynamically redefining the code indicates that by the time synchronization is regained, they may have radically different representations of the code. Synchronization as defined in Section 7.1 refers to synchronization of the bit stream, which is not sufficient for adaptive methods. What is needed here is code synchronization, that is, synchronization of both the bit stream and the dynamic data structure representing the current code mapping.

There is no evidence that adaptive methods are self-synchronizing. Bentley et al. note that, in algorithm BSTW, loss of synchronization can be catastrophic, whereas this is not true with static Huffman coding [Bentley et al. 1986]. Ziv and Lempel recognize that the major drawback of their algorithm is its susceptibility to error propagation [Ziv and Lempel 1977]. Welch also considers the problem of error tolerance of Lempel-Ziv codes and suggests that the entire ensemble be embedded in an error-detecting code [Welch 1984]. Neither static nor adaptive arithmetic coding has the ability to tolerate errors.

NEW DIRECTIONS

Data compression is still very much an active research area. This section suggests possibilities for further study.

The discussion of Section 7 illustrates the susceptibility to error of the codes presented in this survey. Strategies for increasing the reliability of these codes while incurring only a moderate loss of efficiency would be of great value. This area appears to be largely unexplored. Possible approaches include embedding the entire ensemble in an error-correcting code or reserving one or more codewords to act as error flags. For adaptive methods it may be necessary for receiver and sender to verify the current code mapping periodically.

For adaptive Huffman coding, Gallager suggests an "aging" scheme, whereby recent occurrences of a character contribute more to its frequency count than do earlier occurrences [Gallager 1978]. This strategy introduces the notion of locality into the adaptive Huffman scheme. Cormack and Horspool describe an algorithm for approximating exponential aging [Cormack and Horspool 1984]. However, the effectiveness of this algorithm has not been established.

Both Knuth and Bentley et al. suggest the possibility of using the "cache" concept to exploit locality and minimize the effect of anomalous source messages. Preliminary empirical results indicate that this may be helpful [Knuth 1985; Bentley et al. 1986]. A problem related to the use of a cache is overhead time required for deletion. Strategies for reducing the cost of a deletion could be considered. Another possible extension to algorithm BSTW is to investigate other locality heuristics. Bentley et al. prove that intermittent-move-to-front (move-to-front after every k occurrences) is as effective as move-to-front [Bentley et al. 1986]. It should be noted that there are many other self-organizing methods yet to be considered. Horspool and Cormack describe experimental results which imply that the transpose heuristic performs as well as move-to-front, and suggest that it is also easier to implement [Horspool and Cormack 1987].

SUMMARY

Data compression is a topic of much importance and many applications. Methods of data compression have been studied for almost four decades. This paper has provided an overview of data compression methods of general utility. The algorithms have been evaluated in terms of the amount of compression they provide, algorithm efficiency, and susceptibility to error. While algorithm efficiency and susceptibility to error are relatively independent of the characteristics of the source ensemble, the amount of compression achieved depends upon the characteristics of the source to a great extent.

Semantic dependent data compression techniques, as discussed in Section 2, are special-purpose methods designed to exploit local redundancy or context information. A semantic dependent scheme can usually be viewed as a special case of one or more general-purpose algorithms. It should also be noted that algorithm BSTW is a general-purpose technique which exploits locality of reference, a type of local redundancy.

Susceptibility to error is the main drawback of each of the algorithms presented here. Although channel errors are more devastating to adaptive algorithms than to static ones, it is possible for an error to propagate without limit even in the static case. Methods of limiting the effect of an error on the effectiveness of a data compression algorithm should be investigated.

TO DOWN LOAD REPORT AND PPT

DOWNLOAD

SEMINARS AND PROJECTS

DATA COMPRESSION

Pages

Labels

Followers

About Me