Chapter 4.8: Huffman Codes and Data Compression
Data compression has a fundamental problem in the issue of reducing the average number of bits per letter. When large files need to be shipped across communication networks, or stored on hard disks, it;'s important to represent them as compactly as possible.
Prefix Codes
A prefix code for a set S of letters is a function lambda that maps each letter x that is an element of S to some sequence of zeroes and ones, in such a way that for distinct x, y is an element of S, the sequence lambda(x) is not a prefix of the sequence lambda(y).
In order to find optimal prefix codes, we need some sort of measure - the average number of bits required per letter. We denote this quantity by ABL(l). A prefix code that minimizes the ABL(l) is optimal.
Designing the Algorithm
A greedy method will be used to construct an optimal prefix code very efficiently.
We will use a binary tree to describe a prefix code as follows: each time the path goes from a node to its left child, we write down a 0, and each time the path goes from a node to its right child, we write down a 1. We take the resulting string of bits as the encoding of x.
Huffman's Algorithm is as follows
If S has two letters then Encode one letter using 0 and the other letter using 1 Else Let y* and z* be the two lowest-frequency letters Form a new alphabet Sprime by deleting y* and z* and replacing them with a new letter w of frequency f(y*) + f(z*) Recursively construct a prefix code yprime for Sprime, with tree Tprime. Define a prefix code for S as follows: Start with Tprime Take the leaf labeled w and add two children below it labeled y* and z*.
This algorithm follows a bottom-up approach: it focuses on the leaves representing the two lowest-frequency letters, and then continues by recursion. For any given alphabet it achieves the minimum average number of bits per letter of any prefix code. And it runs in O(klogk) using priority queues to insert and extract in O(logk), k times.
There are quite a few proofs and maths in this section. Study this. 9/10.