Table of Contents
4.8 Huffman Codes and Data Compression
The Problem
- Encoding Symbols Using Bits:
- We need schemes to encode text written in richer alphabets and converts this text into long strings of bits.
- The first thought would be to represent all of the symbol by the same number of bits
- However, this is not efficient as some symbols are frequent while others are less frequent.
- Thus symbols are represented differently depending on their frequency.
- Data Compression Algorithms are devoted at representing large files as compact as possibly, that is they take files as input and reduce their space through efficient encoding schemes
Variable-Length Encoding Schemes:
- Basically more frequent letters are represented differently from the less frequent ones.
- The goal is to represent symbols with as short bits as possible
Prefix codes:
- It's the encoding of symbols in such a way that no encoding is a prefix of any other
- A Prefix code for a set S of letters is a function γ that maps each letter x∈ S to some sequence of zeros and ones such that:
- For distinct x,y ∈ S, the sequence γ(x) is not a prefix of the sequence γ(y).
Optimal Prefix Codes :
- For each letter x in S, there is a frequency fx, that represents the fraction of letters in the text that are equal to x
- Assuming there are n letters, n*fx of these letters equal to x
- The frequencies sum to 1 –>∑x ∈ S fx = 1
- |γ(x)|: Length of encoding of x
- ABL = Σx in S fx*|γ(x)|
- ABL is the Average number of Bits per Letter
Designing the Algorithm
- The search space of the problem includes all of the possible ways of mapping letters to bit strings
- We need to develop a tree-based means of representing prefix codes that exposes their structure more clearly
Representing Prefix Codes using Binary Trees
- We label each leaf of the binary tree with a distinct letter in the size of the alphabet S
- For each letter x in S, we follow the path from the root to the leaf labeled x:
- Each time the path goes from a node to its left child, we write down a 0
- Each time the path goes from a node to its right child, we write down a 1
- The Encoding of S constructed from T is a prefix code
- The search for an optimal prefix code reduces to the search of a binary Tree T, together with a labeling of the leaves of T, that minimizes the ABL.
- Thus the length of the encoding of a letter x in S is simply the length from the root to the leaf labeled x: This length is termed the depth of the leaf, and the depth of a leaf v in T will be denoted by depthT(v).
- So now our goal is to find the labeled tree T that minimizes the weighted average of the depths of all leaves, where the average is weighted by the frequencies of the letters that label the leaves: ∑x ∈ S fx*depthT(x).–> ABL
- The Binary Tree corresponding to the optimal prefix code is full
Algorithm to Construct an Optimal Prefix Code
To construct a prefix code for an alphabet S, with given frequencies:
If S has 2 letters:
Encode one letter using 0 and the other using 1Else:
Let y* and z* be the 2 lowest-frequency letters
Form a new alphabet S' by deleting y* and z* and replacing them with
a new letter ω of frequency fy*+ fz*
Recursively Construct a prefix code y' for S', with tree T'
Define a prefix code for S as follows:
Start with T'
Take the leaf labeled ω and add two children below it labeled y* and z*
End if
- The algorithm above is referred to as the Huffman Algorithm and the prefix code it produces is termed the Huffman code
- This algorithm produces an optimal prefix code
- This algorithm also follows a bottom-up approach as seen: It focuses on the leaves representing the two lowest-frequency letters, and then continues by recursion.
Analyzing the Algorithm
- The algorithm is optimal
- ABL(T') = ABL(T) - fω
- The Huffman code for a given alphabet achieves the minimum ABL of any prefix code
Implementation and Running Time
Create a leaf node for each symbol, labeled by its frequency, and add to a
queue(PQ).Takes O(nlogn)
While there is more than one node in the queue: Takes O(n)
Remove the two nodes of lowest frequency. Takes O(logn)
Create a new internal node with these two nodes as children and
with frequency equal to the sum of the two nodes' probabilities. Takes O(1) time.
Add the new node to the queue Takes O(logn)
The remaining node is the tree’s root node.
So overall takes O(nlogn).
Very intriguing section although its proofs were long. I give it an 8/10.