An early method for text compression, outlined in our
January 1982
issue:
An Effective Text-Compression Algorithm
by David Cortesi
It is often desirable to be able to
compress
data: to encode it in a shorter form than normal so that it takes up less storage space. In a recent case, I found it essential. I was constructing a word-processing system based on a computer that had only 4096 bytes of memory. Into that tiny space, I had to cram the program as well as words for it to process.
Choice of com
pression algorithm is dictated by the data characteristics and the amount of space and running time tolerable in the compressing and decompressing routines. In this case, the data was general English text, which is probably the least compressible of any. The compression routines had to be small and simple, but not necessarily fast.
After some figuring, I came up with an algorithm that was fairly simple to implement, quick in execution, and effective. It can usually squeeze text to 75% of its original size. While it may have been written before, the algorithm was new to me. Anyone who needs to compress general text may find it useful, too.
The branch of mathematics called information theory says that data is compressible in so far as it is predictable. That is, the minimum number of bits needed to convey a particular message (using message to mean a piece of information) depends on how many unique messages might be sent. At one limit, if only two messages are sent or stored, then only one bit is ne
eded to encode them. Paul Revere's warning signal in the tower of the Old North Church could have been such a system: "0=land, 1=sea." (Historically, the famous signal was, of course, "one if by land, two if by sea.") At the other extreme, if absolutely any message at all might be sent, then an infinite number of bits would be needed to encode any single message uniquely.
Ordinary data falls somewhere between those theoretical limits, usually much closer to the one-bit end than to the other. For any list of practical messages, a theoretical minimum number of bits is needed to represent any one message. Often, the number of bits actually used to store information is larger than the theoretical minimum. The excess bits are
redundant
. The aim of data compression is to remove as much redundancy as possible.
Character data encoded in the ASCII (American Standard Code for Information Interchange) format constitutes a set of 128 possible messages. Any of the 128 pieces of information can be enco
ded in 7 bits, as a binary number between 0 and 127. Microprocessors designed around an 8-bit word store ASCII characters one per word, for convenience. The inconvenient alternative is to store one and one-seventh characters per word, which would complicate programs considerably. This convenience is bought at a cost of 12% redundancy (1 redundant bit in 8).
Any one collection of data may have even more redundancy. A program in BASIC uses only the uppercase letters, digits, and limited punctuation; fewer than 64 unique characters. The BASIC vocabulary of possible messages could be represented in a code of just 6 bits per character. It's feasible to write a program that would compress a BASIC source file so that every 3-byte group expresses four 6-bit letters. This compression is achieved by predicting and encoding for a smaller vocabulary of messages in the data.
Another type of compression requires knowledge of another kind of predictable characteristic: the statistical distribution of messages in
the data. If you could confidently predict that, for example, 50% of all the characters in a file were the letter Z, you could arrange an encoding based on these rules:
- a 1 bit stands for Z
- a 0 bit says "take the next 7 bits as an ASCII character other than Z"
This would produce a nice compression. Fifty percent of the letters in the file (the Zs) would be stored as single bits; the other 50% as groups of 8 bits. The average number of bits used to store a character would be 4.5. This scheme can be generalized by adding more rules, until every
n
th-commonest letter is encoded in exactly by
n
bits (i.e., the most common character is encoded in 1 bit, the second most common is encoded in 2 bits, and so on).
Two things are wrong with this scheme and its generalized variations. It isn't effective unless each character is stored as a variable number of bits, without regard to the word size of the processor. This usually makes the compression and decompression proce
sses complex and slow. Second, it won't work at all if the prediction of letter frequencies is wrong. If the two rules above are applied to a file that contains no Zs, then all letters will fall under the second rule and be stored as 8 bits, one more than necessary. In general, if the data is not as predicted, this algorithm will expand it instead of compressing it. The more rules in the algorithm, the more predictions the computer makes about the data, and the greater the error when the predictions are wrong.
Let's try another approach to compression and accept that it's a practical necessity to respect the machine's 8-bit word boundaries. Each word can represent any one of 256 messages. Is there a way to make full use of all 256 messages? If so, we would eliminate at least the basic 12% redundancy. If some of the new messages can be made to stand for groups of the old ones (the ASCII characters) then even more redundancy would be eliminated.
A word of caution. The computer makers already may hav
e made assumptions about that "unused" eighth bit in a character byte (the most significant bit, usually designated as bit 7). For example, most firmware monitors assume it is a parity bit and clear it to zero when exchanging a byte with a terminal (thus defeating any value it may have had as a parity check, but never mind). Some video boards use the bit to distinguish the normal character set from a set of 128 graphics symbols. Still, if the compressed data is kept only in storage or in a file and is always decompressed for transmission to a peripheral, it's probably safe to use the eighth bit. That gives us an expanded alphabet of 256 characters to play with, 128 of them new and uncommitted.
One common use of these byte values is the implementation of
runlength encoding
. Each of the 128 new characters is interpreted by these rules:
- set bit 7 to 0, then
- take the resulting integer and replicate the byte that follows it that many times
With this algorithm any string of
3 to 130 identical characters can be expressed in just two bytes. The first byte is one of the new characters; it signals a run of identical characters and tells its length. The second byte indicates the repeated character. When the data predictably contains runs of like characters, then runlength encoding compresses very well. Unfortunately, the general English text with which a word processor must deal contains almost no runs of characters.
I hit upon the idea of using the extra 128 byte values to represent pairs of letters, or
digraphs
. Putting a pair of letters in a single byte will certainly result in compression, but the expanded alphabet will only accommodate 128 unique pairs over and above the standard ASCII characters. To result in compression, the pairs that are encoded must be the pairs that can be predicted to occur the most frequently. Another requirement is that it must be very easy to identify a compressible pair, so that the compression code can be simple.
Cryptographers
have compiled lists of the frequency of use of digraphs in English. It would be possible to include a table of the 128 most frequent digraphs in the compression routine. But that would require 256 bytes of precious space and entail a lengthy search over the list for every pair of candidate letters.
Cryptographers and printers have long known the sequence "etaoinshrdlu" as the frequency order of the twelve most common letters in English. The same letters are the most common in all the Romance languages, although the order varies. Here is one prediction that can be made with confidence about any sample of text. Inside a computer, the blank space is a letter on par with the others; probably the most frequent one of all, so it should be added to the head of the list.
I reasoned that if these are the most common individual letters, then pairs of letters from that list will be common; not necessarily the most common, but frequent enough to result in compression. That has proved to be the case. The b
asic notion of the algorithm is to find adjacent pairs of letters in which both letters are on the list of the most frequently occurring letters and make digraph bytes of those pairs.
I chose the following organization for a digraph byte: 1aaaabbb. Bit 7 is set to 1 to signal a digraph. The next four bits, aaaa, represent a binary number in the range 0...12 and stand for the first letter of the pair. The least significant three bits, bbb, are a number in the range 0...7 and stand for the second letter of the pair. This sort of bit manipulation is usually difficult and always obscure in a highlevel language. In machine language, it is easy to partition a single byte into two or more groups. Notice that it isn't possible to include two 4-bit numbers plus a flag bit in 1 byte. The digraphs that can be encoded in this way are the 104 pairs whose first character is one of the thirteen letters "(space)etaoinshrdlu" and whose second member is one of the shorter list of eight letters "(space)etaoins." A side be
nefit of this encoding is that, because the bits marked "aaaa" won't be used for a number larger than 12, it will never form a byte of the binary form "1111xxxx." The 16 byte values of this form could be used to implement run-length encoding for runs of 3 to 18 characters if that were desired.
I had to implement the algorithm in a tedious manner: by hand-assembling the machine-language instructions and typing them as hexadecimal numbers. This process is likely to produce both typographical and logic errors. To minimize the chance of logic errors, I first wrote the algorithm in a pseudocode, which is a program written in a precise way but not necessarily in any real programming language. Since the pseudocode program will never be read by a machine, one is free to use any kind of notation that will make the meaning clear.
For this project, I carried the pseudocode to a very fine level of detail so that I could translate it directly into machine instructions (
see listing 1
). M
ost of its conventions are those of Pascal, loosened and simplified. The notation @pointer is a concession to the needs of machine-language programming; it means "the byte addressed by pointer."
Procedure COMPRESS is called to compress a single line of characters; the line is terminated by some special character such as a carriage-return. It inspects the line from left to right. If a character is not in the list of thirteen common letters, it is simply copied to the output string; if the copied byte is the end-marker, then the procedure is completed.
When a specific letter is found in the list of thirteen common letters, the next character is tested against the first eight letters of the same list. If it, too, is found, the indices corresponding to the two letters are combined into a single byte and the combined byte is stored.
Function MEMBER tests a character for membership in the list of frequent letters. When it finds the letter in the list, it returns the letter's index in the list, cou
nting from zero. Such origin-zero indices are more convenient to use at the level of machine language. If the character does not appear in the list, MEMBER returns a failure signal.
Procedure DECOMPRESS expands a line that had been processed by COMPRESS. Ordinary characters are just copied to its output. Digraph bytes are split up and the indices they contain are used to find the letters of the pair in the list of common letters.
Figure 1
illustrates the effect of the compression algorithm on a sample of data. The algorithm has proven quite effective. In fact, it is part of the micro word processor used to type this article. Of its 4096 bytes, about 2700 are available for data storage. Compression makes this the equivalent of about 3300 bytes, which is ample room for a typical letter or manuscript page.
The compression code itself occupies fewer than 150 bytes, and the processing overhead it adds is not perceptible in the program's response. I hope the algorithm wil
l work as well in someone else's program as it worked in mine.
Listing 1: Text-compression algorithms as described in the text,
written in a loosely structured pseudocode based on Pascal. The
notation @pointer means "the byte addressed by pointer."
procedure COMPRESS( ADIN: points to the input;
ADOUT: points to the output)
local bytes THIS, THAT,
local numbers FIRST, SECOND.
REPEAT
BEGIN
THIS := @ADIN (pick up next character)
FIRST := MEMBER(THIS,13)
IF ( FIRST ~ 13 ) THEN (THIS is in the long list)
BEGIN
THAT := @(ADIN+1) (check the next byte)
SECOND := MEMBER(THAT,8)
IF ( SECOND / 8) THEN (THAT is in short list)
BEGIN (build a digraph)
THIS := a digraph made from FIRST & SECOND
ADIN := ADIN+1
END
ENDIF
END
ENDIF
@ADOUT := THIS (store byte or digraph)
ADOUT := ADOUT+1 (and bump the pointers)
ADIN := ADIN+l
END
UNTIL ( THIS = string-end-marker byte)
END COMPRESS
function MEMBER( LETTER: a byte; LISTSIZE: a number)
RETURNS a number
(this function returns the origin-zero index of LETTER in
TABLE if it is there, or a failure signal if it is not.
For clarity the signal is shown as a too-high index, but
it could be anything, e.g. setting the carry flag.)
local pointer P, local number T.
P := address of TABLE (point to " etaoinshrdlu")
T := LISTSIZE
REPEAT
BEGIN
IF ( LETTER = @P ) THEN GOTO FOUND
P := P+1
T := T-1
END
UNTIL ( T=O )
RETURN LISTSIZE (indicate failure)
FOUND: (LETTER is in the first LISTSIZE elements of TABLE;
at this point T is in the
range LISTSIZE. .1)
RETURN LISTSIZE-T (..origin-zero index)
END MEMBER
procedure DECOMPRESS( ADCOMP: points to the compressed input;
ADNORM: points to the output)
local bytes THIS, THAT,
local number T.
REPEAT
BEGIN
THIS := @ADCOMP
IF ( Bit 7 of THIS is a 1 ) THEN
BEGIN
T := extracted bits "aaaa" of THIS
@ADNORM := TABLE[T]
ADNORM := ADNORM+1
T := extracted bits "bbb" of THIS
THIS := TABLE[T]
END
ENDIF
@ADNORM := THIS (store 2nd or only character)
ADNORM := ADNORM+l
ADCOMP := ADCOMP+1
END
UNTIL ( THIS = string-end-marker byte )
END DECOMPRESS
Figure 1: Effect of compression on a sample text, from Twelfth Night.
Each parenthesized pair of characters would be stored as a single
byte. There are 339 characters in the
sample; 100 pairs are formed
for a space saving of 29%.
Make me a willow cabin at your gate,
And call upon my soul within the house;
Write loyal cantons of contemned love
And sing them loud even in the dead of night;
Halloo your name to the reverberate hills,
And make the babbling gossip of the air
Cry out "Olivia!" O, you should not rest
Between the elements of air and earth,
But you should pity me!
Mak(e )m(e ) (a )wil(lo)w cab(in) ( a) (t )you(r )g(at)e,
An(d )cal(l )up(on) my( s)ou(l )w(it) (hi) (n )t(he) (ho) (us)e;
W(ri) (te) (lo)ya(l )c(an) (to) (ns) ( o)f c(on) (te)m(ne) (d ) (lo)ve
An(d ) (si)ng( t) (he)m (lo)u(d )ev(en)
( i) (n )t(he) (de)a(d )of( n)ig(ht);
Hal(lo) (o )you(r ) (na)m(e ) (to) ( t) (he)
(re) verbe (ra) (te) (hi) 1 (ls) ,
An(d )mak(e )t(he) babb(li)ng g(os) (si)p( o)f( t) (he) ( a)ir
Cry( o) (ut) "O(li)v(ia) !" 0, yo(u )s(ho)ul(d ) (no) (t ) (re) (St)
B(et)w(ee) (n )t(he) ( e) (le)m(en) (ts) ( o)f
( a)i(r ) (an) (d ) (ea) (rt)h,
B(ut) yo(u )s(ho)ul(d )p(it)y me!
photo_link (84 Kbytes)
