Chapter 4.8: Huffman Codes
The problem in this section has to do with bits. Bits are sequences of 0s and 1s that computers can understand. We need an encoding scheme that takes human language and converts it into bits. The easiest way to do this is to use a fixed number of bits for each symbol in the alphabet, and then just concatenate the bit strings for each symbol to form text. However, at the same time, we want to minimize the number of bits needed, check to make sure no bit is a prefix for another letter, and give the most used letters the smallest number of bits.
An optimal prefix code is one that minimizes the average number of bits per letter. This is the frequency of the letter times the length of the sequence of bits.
The Algorithm
Suppose we take a rooted tree T in which each node that is not a leaf has at most two children. The number of leaves is equal to the size of the alphabet S and we label each leaf with a distinct letter in S. For each letter x in S, we follow a path from the root to the leaf labeled x. Each time the path goes from a node to its left child, we write down a 0, and each time the path goes from a node to its right child, we write down a 1. The resulting string of bits is the encoding of x. The binary tree corresponding to the optimal prefix code is full, so each internal node has exactly 2 children.
The algorithm is as follows:
- If S has two letters then encode one letter using 0 and the other using 1
- Else:
- Let y* and z* be the two lowest frequency letters
- form a new alphabet S' by deleting y* and z* and replacing them with a new letter ω of frequency f_y*+f_z*
- recursively construct a prefix code γ' for S', with tree T'
- Define a prefix code for S as follows:
- Start with T'
- Take a leaf labeled ω and add two children below it labeled y* and z*
Note: The algorithm from class involving metaletters probably makes more sense (even though its the same as this algorithm)
The proof that this algorithm is optimal is crazy long so just trust that it's true!
Using a priority queue, each insertion and extraction of the minimum frequencies can be done in O(logn) and since we do this for each letter in the alphabet (n total), the total running time is O(nlogn).
I would rate this section a 7. I understood it, but it's still a pretty tricky topic.