Back to Vector Steganography

Articles

Home

Huffman codes for compressing messages

A Huffman code is an optimal uniquely decodable (UD) code.  This means two things:

  1. You can decode a message by looking at each incoming character in sequence.  At each step either the characters you have seen do not yet correspond to any letter, or they correspond to exactly one.
  2. The encoded message is the shortest possible one among all encoding schemes that have property (1).

Let's look at an example.  Suppose you want to encode the message "(C) 2002 Directionsmag.com".  Here is a Huffman code for this message:

Character Frequency Code Tree (see below)
' ' 2 000
'(' 1 10110
')' 1 01101
'.' 1 11111
'0' 2 1110
'2' 2 1100
'C' 1 10011
'D' 1 11010
'a' 1 01110
'c' 2 1010
'e' 1 01111
'g' 1 10010
'i' 2 1000
'm' 2 010
'n' 1 01100
'o' 2 001
'r' 1 11011
's' 1 11110
't' 1 10111

The encoded message is: 10110 10011 01101 000 1100 1110 1110 1100 000 11010 1000 11011 01111 1010 10111 1000 001 01100 11110 010 01110 10010 11111 1010 001 010.  Its length is 110 bits, or about 4.24 bits per character.  The original ASCII representation required eight bits per character, totaling 208 bits.  The compression is almost 50%.

The idea behind this code is evident in the table: characters that appear more frequently are assigned shorter codes.  In order for a string to be uniquely decodable, there must be no ambiguity in the decoding.  Thus, the code for one character must never be the start of a code for another.  For instance, assigning "000" to the space character meant that no other code could begin with "000", as you can verify.

Such a code is called a prefix code.  It corresponds to a tree.  The tree describes the sequence of choices a decoder makes as it processes an encoded message.  In the tree shown above, zeros send the decoder upwards along red branches and ones send it downwards along blue branches.  The decoder thereby works its way from left to right until it encounters a letter, outputs it, and begins all over again at the tree's root.  The unique decoding property of the Huffman code corresponds to the fact that every path through the tree ends at a unique letter.  The shortest paths go to the most frequent letters.

The tree for a binary Huffman code is constructed in a straightforward way.  The letters with the two smallest frequencies are located, combined as leaves at a node, and then that node is treated as if it were a letter having the combined frequencies of its two leaves.  The process is repeated until all letters have been put into the tree.

This algorithm requires one to find the smallest frequency in a potentially long list, to do that repeatedly, and to insert each combined frequency back into the list.  An efficient data structure for accomplishing this is a priority queue, which is easily implemented with arrays or (as I had to do in ArcView) lists.  I used the short elegant algorithm described by Jon Bentley in Programming Pearls, 1987, Addision-Wesley, Reading, Mass.  Using a priority queue, the entire algorithm for constructing the Huffman code is fifteen lines: pretty simple.  Moreover, just for fun I implemented the more complex algorithm for generating a Huffman code in any alphabet, not just binary codes.

The biggest problem in implementing a Huffman code concerns the frequencies.  When you are encoding a single message over and over again, such as a copyright, that's simple.  But what if you are putting a long message into a bunch of features?  Or embedding data into the features themselves (an interesting prospect!).  And what do you do about the frequencies when decoding: you don't have the characters available to count!  In that case, a good solution is to use frequencies typical of the messages you expect to write.  To this end, I implemented routines for analyzing a set of text files for their character frequencies.  The results are recorded in a special frequency file that the Huffman code creator can use both for reading and writing hidden messages.  This frequency file has to be referenced by the user whenever a Huffman code is to be used.

You might have noticed that I distinguish upper from lower case characters.  This is important in some applications.  Interestingly, although ArcView's programming language (Avenue) has been extensively used by thousands of people for almost a decade, my attempt to count character frequencies accurately was stymied by one bug after another in the ArcView's string-handling routines.  Days of careful testing turned up a dozen documented problems in Avenue.  It would be unfair to leap to the conclusion that ArcView is buggy.  Far from it: it is one of the most stable GIS programming environments around.  Evidently, people do not do much string processing in their GIS applications--or maybe they just don't notice the occasional error made by the computer, or perhaps they do but are not able to track it down.  It sure took a lot of effort to get it right.

--William A. Huber, Spring 2002

 

Google
ColorRamp, Memorized Calculations, Rotate, Sample, XSect, and Tissot  are  trademarks of Quantitative Decisions.  All other products mentioned are registered trademarks or trademarks of their respective companies.
Questions or problems regarding this web site should be directed to webmaster@quantdec.com.
Copyright © 2000-2002 Quantitative Decisions.  All rights reserved.
Last modified: Saturday June 08, 2002.