Differences

This shows you the differences between two versions of the page.

--- courses:cs211:winter2011:journals:camille:chapter4 [2011/02/21 22:30] – [Section 7: Clustering] cobbc
+++ courses:cs211:winter2011:journals:camille:chapter4 [2011/02/21 23:00] (current) – [Section 8: Huffman Codes and Data Compression] cobbc
@@ Line 79: / Line 79: @@
     * **Summary:**
+This problem concerns data compression. In order to represent text in symbols, you have to change them into bits. One way to do this would be to use a fixed number of bits to represent each letter (ex. if you've got 32 symbols to represent, you'd use 5 bits for each letter). But we want to reduce the average number of bits that it takes to represent letters. You can take advantage of nonuniform frequencies in the number of times each letter comes up to do this (ex. the vowels come up a lot more than the letter 'z' so we really want the representation for vowels to be short, but z can take more than 5 bits if it needs to). Morse code is one good example of this, but they have dots, dashes and pauses (so that's three symbols, not just two like binary). If you converted Morse code into 0s and 1s without having pauses around, you start to run into prefix issues with ambiguity. Ex. if 'a' is encoded to 00 and 'b' is encoded to 001 and 'c' is encoded to 1, then 001 could mean ac or just b. They introduce some notation for making the problem precise. Mostly what's important is that ABL is the average bit length (p. 166 if I need to review). We can use binary trees to encode symbols with different bit lengths without having that prefix issue. All of the leaves are symbols and none of the other nodes are symbols. The path to the leaf/symbol is the encoding of it such that if you take a "left" path, that's a 0 and if you take a "right" path, that's a 1. Pictures are helpful for this (see slides). The book proves that the optimal tree for this is full. They then go into the "intuitive" way to try to create this tree, which is top-down, but that doesn't work. The actual solution for creating this tree is bottom-up. There will always be at least two leaves at the bottom layer of the tree (because it's definitely full). So the smallest two frequencies should be siblings. Then you combine their frequency and the mini tree that's just those two has a bigger frequency. So then you take the two smallest frequencies again out of all of the other frequencies and this new frequency and put the two smallest into a tree together and basically recursively do this. The book says this more formally, proves that this has to be optimal (p. 174) and explains why the overall running time is O(klogk) if k is the number of symbols you're encoding. Sometimes this isn't actually the best way to encode data; they go into some scenarios where it wouldn't have optimal compression.
     * **Insights and Questions:**
+Huffman codes may not always be the best for every set of data, but they are still used for compressing text, aren't they? Because the frequencies of letters in the English language is pretty consistent across different texts.
     *  **Readability:**
+This was a long section, but it was pretty easy to read, except the math they introduced with summations was a little hard to figure out.
 ===== Section 9: Minimum-Cost Arborescences: A Multi-Phase Greedy Algorithm =====