Adaptive Huffman Coding Help Menu
Adaptive Huffman Coding Algorithm
Statistical methods use variable-size codes, with shorter codes assigned to symbols or groups of symbols that appear more often in the data (have a higher probability of occurrence). The (regular) Huffman coding, which assigns longer codewords to less probable source symbols, requires the knowledge of the probabilities of the source symbols. If the probability distribution of the source symbols is not known, then we either have:
In the adaptive Huffman coding procedure, neither transmitter nor receiver knows anything about the statistics of the source sequence at the beginning of the transmission. The tree at both ends is a one-leaf Huffman tree labeled NYT (not yet transmitted). It grows via an update procedure, and is continuously updated with the transmission of every single source symbol. The update procedure consists in adding symbols to the tree, and consequently updating the weights and numbers of the nodes. The purpose of the update procedure is to preserve the sibling property while updating the tree to reflect the latest estimates of the frequency of occurrence of the symbols. The updating procedure is the same for the transmitter as well as for the receiver, so that the whole process is always synchronized.
Note that the update procedure might switch two subtrees. It consults the block to which the current node belongs. A block is the collection of nodes that have the same weight. Switching is performed as long as the node with higher number is not the parent of the node being updated.
The Sibling Property
Before transmission, a fixed encoding (also known as uncompressed code) of the symbols is agreed upon between the transmitter and receiver. The uncompressed code of symbols can be their ASCII code or some other code. It often consists in assigning codes of two different sizes.
In the package, the following simple code is used:
If the source has an alphabet (a_1, a_2, .. , a_m) of size m, then pick e and r such that m = 2^e + R and 0 <= R <= 2^e. The letter ak is encoded as the (e+1)-bit binary representation of k-1, if 1 <= k <= 2R; else, ak is encoded as the e-bit binary representation of k-R-1. In the pakage, the source symbols consist of the 26 lowercase characters of the alphabet and the blank character. By using our model to form the uncompressed code: m=27, and we compute e=4 and R=11, since 27 = 2^4 + 11. To encode the symbols, we check and see if the index k is less than or equal to 2R = 22. The (4+1)-bit encoding of a (i.e., a_1) is 00000. The symbol b (i.e., a_2) is encoded as 00001. q = a_17, so the (4+1)-bit encoding of q is 10000, (i.e. the 5-bit representation of 17-1). On the other hand, w = a_23; k>22. So, the 4-bit encoding of w is 1011, which is the 4-bit representation of 23-11-1=11.
To transmit character a_k:
Please send comments and suggestions about AHuffman to khuri@mathcs.sjsu.edu or hsu0832@sundance.sjsu.edu