A linear algorithm for data compression

98. R. P. Brent, A linear algorithm for data compression, Australian Computer Journal 19, 2 (May 1987), 64-68. Presented at the tenth Australian Computer Science Conference, Deakin University, Geelong, February 1987. Preliminary version appeared as Technical Report TR-CS-86-10, Computer Sciences Laboratory, Australian National University, November 1986, 9 pp.

Paper: pdf (796K).

Abstract

We describe an efficient algorithm for data compression. The algorithm finds maximal common substrings in the input data using a simple hashing scheme, and repeated substrings are encoded using Huffman coding. Time and space requirements are proportional to the size of the input data. A modification which uses a bounded buffer is also described. The algorithm is simpler than comparable linear-time algorithms and gives excellent compression ratios on both text and binary files.

Comments

1. The paper presents some comparisons of an implementation (SLH) of the algorithm with straightforward Huffman coding (HUF), the "move to front" (MTF) algorithm, and one of the Ziv-Lempel algorithms (LZ78). For example, on a typical text file (the Pascal source of the SLH program), compression ratios were 1.75, 2.29, 2.48 and 3.18 for HUF, MTF, LZ78 and SLH respectively.

2. The last phase of SLH uses Huffman coding, but this could (and probably should) be replaced by arithmetic coding (Witten, Neal and Cleary, Communications of the ACM 30 (1987), 520-541). At the time of writing the paper I was not familiar with arithmetic coding.

3. Rod Worley (personal communication, 5 July 1988) pointed out that the algorithm in the paper does not always find a maximal matching substring. For example, with

stringmatchesXmatchestrinGstring

entering the right substrings of the italicised portion matchest after the match of "matches" overwrites the initial "st" entry. The final "string" then gets compared with "strinG" and we only obtain a "strin" match and not a "string" match. Similarly with

matchesmatchesatch

where the final "atch" is matched with the first "atch" and not the second.

In practice such examples seldom arise and have only a small effect on the compression ratio. However, it would be nice to overcome such problems and obtain an algorithm which is linear-time and guarantees to find the longest matching substrings, without abandoning hashing and reverting to the algorithms in the style of Weiner (1973), McCreight (1976), or Rodeh, Pratt and Even (1981) [references are given in the paper].

Go to next publication

Return to Richard Brent's index page