====== 4.8 Huffman Codes and Data Compression ====== ===== The Problem ===== * ** Encoding Symbols Using Bits**: * We need schemes to encode text written in richer alphabets and converts this text into long strings of bits.\\ * The first thought would be to represent all of the symbol by the same number of bits * However, this is not efficient as some symbols are frequent while others are less frequent. * Thus symbols are represented differently depending on their frequency. * //Data Compression Algorithms// are devoted at representing large files as compact as possibly, that is they take files as input and reduce their space through efficient encoding schemes \\ ** Variable-Length Encoding Schemes**: * Basically more frequent letters are represented differently from the less frequent ones. * The goal is to represent symbols with as short bits as possible \\ ** Prefix codes**: * It's the encoding of symbols in such a way that no encoding is a prefix of any other * A //Prefix code// for a set S of letters is a function γ that maps each letter x∈ S to some sequence of zeros and ones such that: * For distinct x,y ∈ S, the sequence γ(x) is not a prefix of the sequence γ(y). \\ ** Optimal Prefix Codes **: * For each letter x in S, there is a frequency fx, that represents the fraction of letters in the text that are equal to x * Assuming there are n letters, n*fx of these letters equal to x * The frequencies sum to 1 -->∑x ∈ S fx = 1 * |γ(x)|: Length of encoding of x * ABL = Σx in S fx*|γ(x)| * ABL is the Average number of Bits per Letter \\ \\ ==== Designing the Algorithm ==== * The search space of the problem includes all of the possible ways of mapping letters to bit strings * We need to develop a tree-based means of representing prefix codes that exposes their structure more clearly \\ \\ ** Representing Prefix Codes using Binary Trees ** * We label each leaf of the binary tree with a distinct letter in the size of the alphabet S * For each letter x in S, we follow the path from the root to the leaf labeled x: * Each time the path goes from a node to its left child, we write down a 0 * Each time the path goes from a node to its right child, we write down a 1 * The Encoding of S constructed from T is a prefix code * The search for an optimal prefix code reduces to the search of a binary Tree T, together with a labeling of the leaves of T, that minimizes the ABL. * Thus the length of the encoding of a letter x in S is simply the length from the root to the leaf labeled x: This length is termed the ** depth** of the leaf, and the depth of a leaf v in T will be denoted by depthT(v). * So now our goal is to find the labeled tree T that minimizes the weighted average of the depths of all leaves, where the average is weighted by the frequencies of the letters that label the leaves: ∑x ∈ S fx*depthT(x).--> ABL * The Binary Tree corresponding to the optimal prefix code is full ==== Algorithm to Construct an Optimal Prefix Code ==== >>>>>>>>>>>>>>>>> To construct a prefix code for an alphabet S, with given frequencies:\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If S has 2 letters:\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Encode one letter using 0 and the other using 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Else:\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Let y* and z* be the 2 lowest-frequency letters\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Form a new alphabet S' by deleting y* and z* and replacing them with\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a new letter ω of frequency fy*+ fz*\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Recursively Construct a prefix code y' for S', with tree T'\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Define a prefix code for S as follows:\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Start with T'\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Take the leaf labeled ω and add two children below it labeled y* and z*\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> End if\\ \\ * The algorithm above is referred to as the //Huffman Algorithm// and the prefix code it produces is termed the //Huffman code// * This algorithm produces an optimal prefix code * This algorithm also follows a bottom-up approach as seen: It focuses on the leaves representing the two lowest-frequency letters, and then continues by recursion. \\ \\ ==== Analyzing the Algorithm ==== * The algorithm is optimal * ABL(T') = ABL(T) - fω * The Huffman code for a given alphabet achieves the minimum ABL of any prefix code ==== Implementation and Running Time ==== >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Create a leaf node for each symbol, labeled by its frequency, and add to a\\ queue(PQ).Takes O(nlogn)\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> While there is more than one node in the queue: Takes O(n)\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Remove the two nodes of lowest frequency. Takes O(logn)\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Create a new internal node with these two nodes as children and\\ with frequency equal to the sum of the two nodes' probabilities. Takes O(1) time. \\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Add the new node to the queue Takes O(logn)\\ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The remaining node is the tree’s root node.\\ \\ So overall takes O(nlogn).\\ \\ Very intriguing section although its proofs were long. I give it an 8/10.