Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

Articles15 Years Ago in BYTE


January 1997 / Blasts From The Past / 15 Years Ago in BYTE

We covered the new IBM PC in depth; we figured it was important enough to merit a second look. Guess we made the right call on that one.


An early method for text compression, outlined in our January 1982 issue:

An Effective Text-Compression Algorithm

by David Cortesi

It is often desirable to be able to compress data: to encode it in a shorter form than normal so that it takes up less storage space. In a recent case, I found it essential. I was constructing a word-processing system based on a computer that had only 4096 bytes of memory. Into that tiny space, I had to cram the program as well as words for it to process.

Choice of com pression algorithm is dictated by the data characteristics and the amount of space and running time tolerable in the compressing and decompressing routines. In this case, the data was general English text, which is probably the least compressible of any. The compression routines had to be small and simple, but not necessarily fast.

After some figuring, I came up with an algorithm that was fairly simple to implement, quick in execution, and effective. It can usually squeeze text to 75% of its original size. While it may have been written before, the algorithm was new to me. Anyone who needs to compress general text may find it useful, too.

The branch of mathematics called information theory says that data is compressible in so far as it is predictable. That is, the minimum number of bits needed to convey a particular message (using message to mean a piece of information) depends on how many unique messages might be sent. At one limit, if only two messages are sent or stored, then only one bit is ne eded to encode them. Paul Revere's warning signal in the tower of the Old North Church could have been such a system: "0=land, 1=sea." (Historically, the famous signal was, of course, "one if by land, two if by sea.") At the other extreme, if absolutely any message at all might be sent, then an infinite number of bits would be needed to encode any single message uniquely.

Ordinary data falls somewhere between those theoretical limits, usually much closer to the one-bit end than to the other. For any list of practical messages, a theoretical minimum number of bits is needed to represent any one message. Often, the number of bits actually used to store information is larger than the theoretical minimum. The excess bits are redundant . The aim of data compression is to remove as much redundancy as possible.

Character data encoded in the ASCII (American Standard Code for Information Interchange) format constitutes a set of 128 possible messages. Any of the 128 pieces of information can be enco ded in 7 bits, as a binary number between 0 and 127. Microprocessors designed around an 8-bit word store ASCII characters one per word, for convenience. The inconvenient alternative is to store one and one-seventh characters per word, which would complicate programs considerably. This convenience is bought at a cost of 12% redundancy (1 redundant bit in 8).

Any one collection of data may have even more redundancy. A program in BASIC uses only the uppercase letters, digits, and limited punctuation; fewer than 64 unique characters. The BASIC vocabulary of possible messages could be represented in a code of just 6 bits per character. It's feasible to write a program that would compress a BASIC source file so that every 3-byte group expresses four 6-bit letters. This compression is achieved by predicting and encoding for a smaller vocabulary of messages in the data.

Another type of compression requires knowledge of another kind of predictable characteristic: the statistical distribution of messages in the data. If you could confidently predict that, for example, 50% of all the characters in a file were the letter Z, you could arrange an encoding based on these rules:

  • a 1 bit stands for Z
  • a 0 bit says "take the next 7 bits as an ASCII character other than Z"

This would produce a nice compression. Fifty percent of the letters in the file (the Zs) would be stored as single bits; the other 50% as groups of 8 bits. The average number of bits used to store a character would be 4.5. This scheme can be generalized by adding more rules, until every n th-commonest letter is encoded in exactly by n bits (i.e., the most common character is encoded in 1 bit, the second most common is encoded in 2 bits, and so on).

Two things are wrong with this scheme and its generalized variations. It isn't effective unless each character is stored as a variable number of bits, without regard to the word size of the processor. This usually makes the compression and decompression proce sses complex and slow. Second, it won't work at all if the prediction of letter frequencies is wrong. If the two rules above are applied to a file that contains no Zs, then all letters will fall under the second rule and be stored as 8 bits, one more than necessary. In general, if the data is not as predicted, this algorithm will expand it instead of compressing it. The more rules in the algorithm, the more predictions the computer makes about the data, and the greater the error when the predictions are wrong.

Let's try another approach to compression and accept that it's a practical necessity to respect the machine's 8-bit word boundaries. Each word can represent any one of 256 messages. Is there a way to make full use of all 256 messages? If so, we would eliminate at least the basic 12% redundancy. If some of the new messages can be made to stand for groups of the old ones (the ASCII characters) then even more redundancy would be eliminated.

A word of caution. The computer makers already may hav e made assumptions about that "unused" eighth bit in a character byte (the most significant bit, usually designated as bit 7). For example, most firmware monitors assume it is a parity bit and clear it to zero when exchanging a byte with a terminal (thus defeating any value it may have had as a parity check, but never mind). Some video boards use the bit to distinguish the normal character set from a set of 128 graphics symbols. Still, if the compressed data is kept only in storage or in a file and is always decompressed for transmission to a peripheral, it's probably safe to use the eighth bit. That gives us an expanded alphabet of 256 characters to play with, 128 of them new and uncommitted.

One common use of these byte values is the implementation of runlength encoding . Each of the 128 new characters is interpreted by these rules:

  • set bit 7 to 0, then
  • take the resulting integer and replicate the byte that follows it that many times

With this algorithm any string of 3 to 130 identical characters can be expressed in just two bytes. The first byte is one of the new characters; it signals a run of identical characters and tells its length. The second byte indicates the repeated character. When the data predictably contains runs of like characters, then runlength encoding compresses very well. Unfortunately, the general English text with which a word processor must deal contains almost no runs of characters.

I hit upon the idea of using the extra 128 byte values to represent pairs of letters, or digraphs . Putting a pair of letters in a single byte will certainly result in compression, but the expanded alphabet will only accommodate 128 unique pairs over and above the standard ASCII characters. To result in compression, the pairs that are encoded must be the pairs that can be predicted to occur the most frequently. Another requirement is that it must be very easy to identify a compressible pair, so that the compression code can be simple.

Cryptographers have compiled lists of the frequency of use of digraphs in English. It would be possible to include a table of the 128 most frequent digraphs in the compression routine. But that would require 256 bytes of precious space and entail a lengthy search over the list for every pair of candidate letters.

Cryptographers and printers have long known the sequence "etaoinshrdlu" as the frequency order of the twelve most common letters in English. The same letters are the most common in all the Romance languages, although the order varies. Here is one prediction that can be made with confidence about any sample of text. Inside a computer, the blank space is a letter on par with the others; probably the most frequent one of all, so it should be added to the head of the list.

I reasoned that if these are the most common individual letters, then pairs of letters from that list will be common; not necessarily the most common, but frequent enough to result in compression. That has proved to be the case. The b asic notion of the algorithm is to find adjacent pairs of letters in which both letters are on the list of the most frequently occurring letters and make digraph bytes of those pairs.

I chose the following organization for a digraph byte: 1aaaabbb. Bit 7 is set to 1 to signal a digraph. The next four bits, aaaa, represent a binary number in the range 0...12 and stand for the first letter of the pair. The least significant three bits, bbb, are a number in the range 0...7 and stand for the second letter of the pair. This sort of bit manipulation is usually difficult and always obscure in a highlevel language. In machine language, it is easy to partition a single byte into two or more groups. Notice that it isn't possible to include two 4-bit numbers plus a flag bit in 1 byte. The digraphs that can be encoded in this way are the 104 pairs whose first character is one of the thirteen letters "(space)etaoinshrdlu" and whose second member is one of the shorter list of eight letters "(space)etaoins." A side be nefit of this encoding is that, because the bits marked "aaaa" won't be used for a number larger than 12, it will never form a byte of the binary form "1111xxxx." The 16 byte values of this form could be used to implement run-length encoding for runs of 3 to 18 characters if that were desired.

I had to implement the algorithm in a tedious manner: by hand-assembling the machine-language instructions and typing them as hexadecimal numbers. This process is likely to produce both typographical and logic errors. To minimize the chance of logic errors, I first wrote the algorithm in a pseudocode, which is a program written in a precise way but not necessarily in any real programming language. Since the pseudocode program will never be read by a machine, one is free to use any kind of notation that will make the meaning clear.

For this project, I carried the pseudocode to a very fine level of detail so that I could translate it directly into machine instructions ( see listing 1 ). M ost of its conventions are those of Pascal, loosened and simplified. The notation @pointer is a concession to the needs of machine-language programming; it means "the byte addressed by pointer."

Procedure COMPRESS is called to compress a single line of characters; the line is terminated by some special character such as a carriage-return. It inspects the line from left to right. If a character is not in the list of thirteen common letters, it is simply copied to the output string; if the copied byte is the end-marker, then the procedure is completed.

When a specific letter is found in the list of thirteen common letters, the next character is tested against the first eight letters of the same list. If it, too, is found, the indices corresponding to the two letters are combined into a single byte and the combined byte is stored.

Function MEMBER tests a character for membership in the list of frequent letters. When it finds the letter in the list, it returns the letter's index in the list, cou nting from zero. Such origin-zero indices are more convenient to use at the level of machine language. If the character does not appear in the list, MEMBER returns a failure signal.

Procedure DECOMPRESS expands a line that had been processed by COMPRESS. Ordinary characters are just copied to its output. Digraph bytes are split up and the indices they contain are used to find the letters of the pair in the list of common letters.

Figure 1 illustrates the effect of the compression algorithm on a sample of data. The algorithm has proven quite effective. In fact, it is part of the micro word processor used to type this article. Of its 4096 bytes, about 2700 are available for data storage. Compression makes this the equivalent of about 3300 bytes, which is ample room for a typical letter or manuscript page.

The compression code itself occupies fewer than 150 bytes, and the processing overhead it adds is not perceptible in the program's response. I hope the algorithm wil l work as well in someone else's program as it worked in mine.


Listing 1

Listing 1: Text-compression algorithms as described in the text,
written in a loosely structured pseudocode based on Pascal. The
notation @pointer means "the byte addressed by pointer."


procedure COMPRESS( ADIN: points to the input;
                     ADOUT: points to the output)
   local bytes THIS, THAT,
   local numbers FIRST,   SECOND.
   REPEAT
      BEGIN
         THIS := @ADIN                    (pick up next character)
         FIRST := MEMBER(THIS,13)
         IF ( FIRST ~ 13 )   THEN         (THIS is in the long list)
            BEGIN
               THAT := @(ADIN+1)          (check the next byte)
               SECOND := MEMBER(THAT,8)
               IF  ( SECOND / 8)  THEN    (THAT is in short list)
                   BEGIN                  (build a digraph)
                      THIS  := a digraph made from FIRST & SECOND
                      ADIN := ADIN+1

                   END
               ENDIF
            END
         ENDIF
         @ADOUT := THIS                   (store byte or digraph)
         ADOUT := ADOUT+1                 (and bump the pointers)
         ADIN := ADIN+l
      END
   UNTIL ( THIS = string-end-marker byte)
END COMPRESS
function MEMBER( LETTER: a byte;   LISTSIZE: a number)
             RETURNS a number

  (this function returns the origin-zero index of LETTER in
   TABLE if it is there, or a failure signal if it is not.
   For clarity the signal is shown as a too-high index, but
   it could be anything, e.g.   setting the carry flag.)

   local pointer P,  local number T.
   P := address of TABLE   (point to " etaoinshrdlu")
   T := LISTSIZE
   REPEAT
      BEGIN
         IF ( LETTER = @P )  THEN GOTO FOUND
         P := P+1
         T := T-1
      END
   UNTIL ( T=O )
   RETURN LISTSIZE    (indicate failure)
FOUND:  (LETTER is in the first LISTSIZE elements of TABLE;
         at this point T is in the
 range LISTSIZE. .1)
   RETURN LISTSIZE-T    (..origin-zero index)
END MEMBER

procedure DECOMPRESS( ADCOMP: points to the compressed input;
                      ADNORM: points to the output)
   local bytes THIS, THAT,
   local number T.
   REPEAT
      BEGIN
         THIS := @ADCOMP
         IF  ( Bit 7 of THIS is a 1 ) THEN
             BEGIN
               T := extracted bits "aaaa" of THIS
               @ADNORM := TABLE[T]
               ADNORM := ADNORM+1
               T := extracted bits "bbb" of THIS
               THIS := TABLE[T]
             END
         ENDIF
         @ADNORM := THIS  (store 2nd or only character)
         ADNORM := ADNORM+l
         ADCOMP := ADCOMP+1
      END
   UNTIL ( THIS = string-end-marker byte )
END DECOMPRESS




Figure 1

Figure 1: Effect of compression on a sample text, from Twelfth Night.
Each parenthesized pair of characters would be stored as a single
byte. There are 339 characters in the
 sample; 100 pairs are formed
for a space saving of 29%.


Make me a willow cabin at your gate,
And call upon my soul within the house;
Write loyal cantons of contemned love
And sing them loud even in the dead of night;
Halloo your name to the reverberate hills,
And make the babbling gossip of the air
Cry out "Olivia!" O, you should not rest
Between the elements of air and earth,
But you should pity me!


Mak(e )m(e ) (a )wil(lo)w cab(in) ( a) (t )you(r )g(at)e,
An(d )cal(l )up(on)   my( s)ou(l )w(it) (hi) (n )t(he)   (ho) (us)e;
W(ri) (te)  (lo)ya(l )c(an) (to) (ns) ( o)f c(on) (te)m(ne) (d ) (lo)ve
An(d ) (si)ng( t) (he)m  (lo)u(d )ev(en)
                         ( i) (n )t(he) (de)a(d )of( n)ig(ht);
Hal(lo) (o )you(r ) (na)m(e ) (to) ( t) (he)
                         (re) verbe (ra) (te) (hi) 1 (ls) ,
An(d )mak(e )t(he)   babb(li)ng g(os) (si)p( o)f( t) (he) ( a)ir
Cry( o) (ut)  "O(li)v(ia) !" 0, yo(u )s(ho)ul(d ) (no) (t ) (re) (St)
B(et)w(ee) (n )t(he) ( e) (le)m(en) (ts) ( o)f

                          ( a)i(r ) (an) (d ) (ea) (rt)h,
B(ut)  yo(u )s(ho)ul(d )p(it)y me!




January 1982

photo_link (84 Kbytes)


Up to the Blasts From The Past section contentsGo to previous article: Go to next article: 20 Years Ago in BYTESearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network