Differences

This shows you the differences between two versions of the page.

--- courses:cs211:winter2011:journals:camille:chapter4 [2011/02/21 22:16] – [Section 5: The Minimum Spanning Tree Problem] cobbc
+++ courses:cs211:winter2011:journals:camille:chapter4 [2011/02/21 23:00] (current) – [Section 8: Huffman Codes and Data Compression] cobbc
@@ Line 68: / Line 68: @@
     * **Summary:**
+Oh hey the answer to other times that a minimum spanning tree can be useful! (We did this in class ... why didn't I still have to ask that question earlier haha?). Clustering comes up when you've got a collection of things and you're trying to separate them into coherent groups. How do you tell how similar objects are? The distance function - the farther away two objects are, the less similar they are (the closer objects are the more likely they are to be in the same cluster). If you're trying to find a clustering of maximum spacing, then you want the shortest distance between clusters to be as big as possible. They explain the "long way" of doing this, but basically if you want k clusters, you should delete the k biggest edges from the minimum spanning tree. Then those are the shortest distances between the clusters and they're at a maximum. They also prove that it works.
     * **Insights and Questions:**
+They talk about the distance function being one way to cluster, but there must be other things to base it on. Also, in the context of pictures that they mentioned, the pixel stuff they talked about is probably not the best way to cluster pictures into similar groups the way you'd really want to in real life. Interesting stuff.
     *  **Readability:**
+Pretty easy to read. They sort of "went through with" the explanation of the "long way" (i.e. non-minimum-spanning tree explanation) to the solution more than we did before they started talking about the minimum spanning tree. I found that interesting.
+===== Section 8: Huffman Codes and Data Compression =====
+    * **Summary:**
+This problem concerns data compression. In order to represent text in symbols, you have to change them into bits. One way to do this would be to use a fixed number of bits to represent each letter (ex. if you've got 32 symbols to represent, you'd use 5 bits for each letter). But we want to reduce the average number of bits that it takes to represent letters. You can take advantage of nonuniform frequencies in the number of times each letter comes up to do this (ex. the vowels come up a lot more than the letter 'z' so we really want the representation for vowels to be short, but z can take more than 5 bits if it needs to). Morse code is one good example of this, but they have dots, dashes and pauses (so that's three symbols, not just two like binary). If you converted Morse code into 0s and 1s without having pauses around, you start to run into prefix issues with ambiguity. Ex. if 'a' is encoded to 00 and 'b' is encoded to 001 and 'c' is encoded to 1, then 001 could mean ac or just b. They introduce some notation for making the problem precise. Mostly what's important is that ABL is the average bit length (p. 166 if I need to review). We can use binary trees to encode symbols with different bit lengths without having that prefix issue. All of the leaves are symbols and none of the other nodes are symbols. The path to the leaf/symbol is the encoding of it such that if you take a "left" path, that's a 0 and if you take a "right" path, that's a 1. Pictures are helpful for this (see slides). The book proves that the optimal tree for this is full. They then go into the "intuitive" way to try to create this tree, which is top-down, but that doesn't work. The actual solution for creating this tree is bottom-up. There will always be at least two leaves at the bottom layer of the tree (because it's definitely full). So the smallest two frequencies should be siblings. Then you combine their frequency and the mini tree that's just those two has a bigger frequency. So then you take the two smallest frequencies again out of all of the other frequencies and this new frequency and put the two smallest into a tree together and basically recursively do this. The book says this more formally, proves that this has to be optimal (p. 174) and explains why the overall running time is O(klogk) if k is the number of symbols you're encoding. Sometimes this isn't actually the best way to encode data; they go into some scenarios where it wouldn't have optimal compression.
+    * **Insights and Questions:**
+Huffman codes may not always be the best for every set of data, but they are still used for compressing text, aren't they? Because the frequencies of letters in the English language is pretty consistent across different texts.
+    *  **Readability:**
+This was a long section, but it was pretty easy to read, except the math they introduced with summations was a little hard to figure out.
 ===== Section 9: Minimum-Cost Arborescences: A Multi-Phase Greedy Algorithm =====