4.8 Huffman Codes and Data Compression

The Problem

Encoding Symbols Using Bits:
- We need schemes to encode text written in richer alphabets and converts this text into long strings of bits.
- The first thought would be to represent all of the symbol by the same number of bits
- However, this is not efficient as some symbols are frequent while others are less frequent.
- Thus symbols are represented differently depending on their frequency.
- Data Compression Algorithms are devoted at representing large files as compact as possibly, that is they take files as input and reduce their space through efficient encoding schemes

Variable-Length Encoding Schemes:

Basically more frequent letters are represented differently from the less frequent ones.
The goal is to represent symbols with as short bits as possible

Prefix codes:

It's the encoding of symbols in such a way that no encoding is a prefix of any other
A Prefix code for a set S of letters is a function γ that maps each letter x∈ S to some sequence of zeros and ones such that:
- For distinct x,y ∈ S, the sequence γ(x) is not a prefix of the sequence γ(y).

Optimal Prefix Codes :

For each letter x in S, there is a frequency f_x, that represents the fraction of letters in the text that are equal to x
Assuming there are n letters, n*f_x of these letters equal to x
The frequencies sum to 1 –>∑_{x ∈ S} f_x = 1
|γ(x)|: Length of encoding of x
ABL = Σ_{x in S} f_x*|γ(x)|
ABL is the Average number of Bits per Letter

Designing the Algorithm

The search space of the problem includes all of the possible ways of mapping letters to bit strings
We need to develop a tree-based means of representing prefix codes that exposes their structure more clearly

Representing Prefix Codes using Binary Trees

We label each leaf of the binary tree with a distinct letter in the size of the alphabet S
For each letter x in S, we follow the path from the root to the leaf labeled x:
- Each time the path goes from a node to its left child, we write down a 0
- Each time the path goes from a node to its right child, we write down a 1
The Encoding of S constructed from T is a prefix code
The search for an optimal prefix code reduces to the search of a binary Tree T, together with a labeling of the leaves of T, that minimizes the ABL.
Thus the length of the encoding of a letter x in S is simply the length from the root to the leaf labeled x: This length is termed the depth of the leaf, and the depth of a leaf v in T will be denoted by depth_T(v).
So now our goal is to find the labeled tree T that minimizes the weighted average of the depths of all leaves, where the average is weighted by the frequencies of the letters that label the leaves: ∑_{x ∈ S} f_x*depth_T(x).–> ABL
The Binary Tree corresponding to the optimal prefix code is full

Algorithm to Construct an Optimal Prefix Code

To construct a prefix code for an alphabet S, with given frequencies:

If S has 2 letters:

Encode one letter using 0 and the other using 1

Else:

Let y* and z* be the 2 lowest-frequency letters

Form a new alphabet S' by deleting y* and z* and replacing them with

a new letter ω of frequency f_y*+ f_z*

Recursively Construct a prefix code y' for S', with tree T'

Define a prefix code for S as follows:

Start with T'

Take the leaf labeled ω and add two children below it labeled y* and z*

End if

The algorithm above is referred to as the Huffman Algorithm and the prefix code it produces is termed the Huffman code
This algorithm produces an optimal prefix code
This algorithm also follows a bottom-up approach as seen: It focuses on the leaves representing the two lowest-frequency letters, and then continues by recursion.

Analyzing the Algorithm

The algorithm is optimal
ABL(T') = ABL(T) - f_ω
The Huffman code for a given alphabet achieves the minimum ABL of any prefix code

Implementation and Running Time

Create a leaf node for each symbol, labeled by its frequency, and add to a
queue(PQ).Takes O(nlogn)

While there is more than one node in the queue: Takes O(n)

Remove the two nodes of lowest frequency. Takes O(logn)

Create a new internal node with these two nodes as children and
with frequency equal to the sum of the two nodes' probabilities. Takes O(1) time.

Add the new node to the queue Takes O(logn)

The remaining node is the tree’s root node.

So overall takes O(nlogn).

Very intriguing section although its proofs were long. I give it an 8/10.

Table of Contents