Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| courses:cs211:winter2018:journals:patelk:chapter4 [2018/03/10 16:00] – [4.8 Huffman Codes and Data COmpression] patelk | courses:cs211:winter2018:journals:patelk:chapter4 [2018/03/10 20:29] (current) – [4.8 Huffman Codes and Data COmpression] patelk | ||
|---|---|---|---|
| Line 345: | Line 345: | ||
| Readability: | Readability: | ||
| - | ===== 4.8 Huffman Codes and Data COmpression | + | ===== 4.8 Huffman Codes and Data Compression |
| * Greedy rule to shrink the size of the problem instance so that an equivalent smaller problem can then be solved by recursion | * Greedy rule to shrink the size of the problem instance so that an equivalent smaller problem can then be solved by recursion | ||
| Line 352: | Line 352: | ||
| **The Problem** | **The Problem** | ||
| - | vzc | + | * __Encoding Symbols Using Bits__ |
| * Use a fixed number of bits to represent each symbol in the alphabet | * Use a fixed number of bits to represent each symbol in the alphabet | ||
| * Concatenate bit strings to form text | * Concatenate bit strings to form text | ||
| Line 369: | Line 369: | ||
| * Total length of encoding = sum, over all letters x in S, of the number of times x occurs times the length of the bit string γ(x) used to encode x. | * Total length of encoding = sum, over all letters x in S, of the number of times x occurs times the length of the bit string γ(x) used to encode x. | ||
| * Average number of bits required per letter = ABL(γ0 | * Average number of bits required per letter = ABL(γ0 | ||
| - | | + | |
| + | **Designing the Algorithm** | ||
| + | * __Representing Prefix Codes Using Binary Trees__ | ||
| + | * Rooted binary tree T: number of leaves is equal to the size of the alphabet S; each leaf has a distinct letter in S | ||
| + | * For each letter x in S, we follow the path from the root to the leaf labeled x; each time the path goes from a node to its left child, we write down a 0, and each time the path goes from a node to its right child, we write down a 1. | ||
| + | * //The encoding of S constructed from T is a prefix code.// | ||
| + | * For the encoding of x to be a prefix of the encoding of y: path from root to x would have to be a prefix of the path from root to y. Same as saying that x would lie on the path from the root to y, which isn't possible if x is a leaf. | ||
| + | * Given a prefix code y, we can build a binary tree recursively | ||
| + | * Start with root, all letters x in S whose encodings begin with a 0 = leaves in the left subtree of the root; opposite for encodings beginning with a 1; use recursion | ||
| + | * A binary tree is full if each node that is not a leaf has two children: there are no nodes with exactly one child. | ||
| + | * //Binary tree corresponding to the optimal prefix code is full// | ||
| + | * Proof using an exchange argument | ||
| + | * T = binary tree corresponding to the optimal prefix code | ||
| + | * Suppose it contains a node u with exactly one child v | ||
| + | * Convert T into T' by replacing node u with v. | ||
| + | * if u = root, delete u and use v as root | ||
| + | * otherwise, let w be the parent of u in T, delete u, and make v be a child of w in place of u | ||
| + | * __A First Attempt: The Top-Down Approach__ | ||
| + | * Goal: produce a labeled binary tree in which the leaves are as close to the root as possible | ||
| + | * Pack leaves as tightly as possible: want a " | ||
| + | * Shannon-Fano Codes: no version of this top-down splitting strategy is guaranteed to always produce an optimal prefix code | ||
| + | * //What If We Knew The Tree Structure of the Optimal Prefix Code?// | ||
| + | * Assume we have the binary tree T* corresponding to an optimal prefix code - need to figure out the labeling | ||
| + | * Suppose that u and v are leaves of T*, such that depth(u) < depth(v). Further suppose that in a labeling of T* corresponding to an optimal prefix code, leaf u is labeled with y in S and leaf v is labeled with z in S. Then fy >= fx. | ||
| + | * Proof using an exchange argument | ||
| + | * if fy<fz, then effect of the exchange is as follows: the multiplier on fy increases [from depth(u) to depth(v)] and the multiple on fz decreases by the same amount [from depth(v) to depth(u)]. | ||
| + | * change to the overall sum is a negative number, contradicting the supposed optimality of the prefix code. | ||
| + | * Having a lower-frequency letter at a strictly smaller depth than some other higher-frequency letter rules out for an optimal solution | ||
| + | * Take all leaves of depth 1, label them with the highest-frequency letters, do the same with each level of increasing depth -> leads to an optimal T* | ||
| + | * w is a leaf of T* | ||
| + | * If w were not a leaf, there would be some leaf w' in the subtree below it. But then w' would have a depth greater than that of v, contradicting our assumption that v is a leaf of maximum depth in T*. | ||
| + | * There is an optimal prefix code, with corresponding tree T*, in which the two lowest-frequency letters are assigned to leaves that are sibling in T*. | ||
| + | * //An Algorithm to Construct an Optimal Prefix Code// | ||
| + | * Suppose y* and z* are the two lowest-frequency letters in S. | ||
| + | * can lock them together because we know they end up as sibling leaves below a common parent | ||
| + | * Replace y* and z* with meta-letter, | ||
| + | // | ||
| + | |||
| + | {{: | ||
| + | |||
| + | {{: | ||
| + | |||
| + | * bottom-up approach: focuses on the leaves representing the two lowest-frequency letters, and then continues by recursion. | ||
| + | |||
| + | **Analyzing the Algorithm** | ||
| + | * //The Optimality of the Algorithm// | ||
| + | * Recursive algorithm: proof by induction | ||
| + | * Base: optimal for all two letter alphabets (one bit per letter) | ||
| + | * By induction: optimal for all alphabets of size k-1 | ||
| + | * Consider: alphabet S of size k | ||
| + | * The Huffman code for a given alphabet achieves the minimum average number of bits per letter of any prefix code | ||
| + | * // | ||
| + | * Can be polynomial time in k, the number of letters in the alphabet | ||
| + | * recursive calls: sequence of k-1 iterations | ||
| + | * each iteration consists simply of identifying the two lowest-frequency letters -> O(k) | ||
| + | * combined: O(k^2) | ||
| + | * Use a priority queue to maintain a set of k elements | ||
| + | * allows for insertion and extraction | ||
| + | * Each iteration: extract the minimum twice; insert a new letter whose key is the sum of these two minimum frequencies | ||
| + | * Using priority queues via heaps: each insertion and extraction: O(logk) -> over k iterations, O(klogk) | ||
| + | |||
| + | **Extensions** | ||
| + | * Data Compression: | ||
| + | * need to be highly compressed: use a " | ||
| + | * Cannot adapt to changes in text: frequency is stable for half the text, but then changes radically | ||
| + | * Halfway into the sequence insert some kind of instruction to completely change the encoding | ||
| + | * This will pay off in the long run: called adaptive compression schemes | ||
| + | |||
| + | ==== Personal Thoughts ==== | ||
| + | |||
| + | While arguably one of the more interesting topics of this textbook, this section was long and very dense. There was a lot of information that was hard to digest all at once. Our discussion in class was helpful, but not long enough to help make this section very clear. I will most likely need to review this chapter section again. | ||
| + | Readability: | ||
