Both sides previous revisionPrevious revisionNext revision | Previous revision |
courses:cs211:winter2011:journals:camille:chapter4 [2011/02/15 13:01] – [Section 4: Shortest Paths in a Graph] cobbc | courses:cs211:winter2011:journals:camille:chapter4 [2011/02/21 23:00] (current) – [Section 8: Huffman Codes and Data Compression] cobbc |
---|
| |
* **Summary:** | * **Summary:** |
| How do you build a "communication network" on top of a connected graph such that there's a path between every pair of nodes, but so that it's done as cheaply as possible? The solution is going to be a tree, because if there were any cycles, you could get rid of any one of the edges in the cycle and still reach every node but decrease the overall cost (proof p. 142) (hence, the minimum spanning TREE). This problem has a bunch of intuitive greedy algorithms that actually work. (1) Start without any edges and insert edges from the graph in order of increasing cost as long as adding that edge doesn't create a cycle (Kruskal's Algorithm). (2) Start some node s and grow the tree greedily by attaching the node that can be added as cheaply as possible (Prim's Algorithm ... modified Dijkstra's). (3) Start with the full graph and delete edges starting with the most expensive edge unless deleting an edge will disconnect the graph (Reverse-Delete Algorithm). The Cut Property: given some subset S of nodes that is not empty or equal to all of V, the minimum edge with one end in S and one end in V-S will definitely be in the minimum spanning tree of V (proof p.145). That lets you easily prove Kruskal's and Prim's Algorithms. The Cycle Property: the most expensive edge in any cycle in a graph will definitely not belong to the minimum spanning tree (proof p.147). That lets you prove the Reverse-Delete Algorithm. With a heap-based priority queue, you can get Prim's algorithm to run in O(mlogn) time (Kruskal's will also run in the same time, see next section). Note: this problem was introduced in the context of a communication network, but it's probably not the right kind of solution to a network problem, because cutting any one edge disconnects the network, and with so few edges there are likely to be congestion issues. |
| |
* **Insights and Questions:** | * **Insights and Questions:** |
| If this kind of solution isn't good for networks, what is it actually useful for? There has to be something, right? |
| |
* **Readability:** | * **Readability:** |
| Super easy to read! The only thing is that we emphasized the "naming" of the Cut property and Cycle property more than they did. They called it the cut/cycle property but they didn't bold those words or make them titles or anything (I missed them my first read-through). |
===== Section 6: Implementing Kruskal's Algorithm: The Union-Find Data Structure ===== | ===== Section 6: Implementing Kruskal's Algorithm: The Union-Find Data Structure ===== |
| |
* **Summary:** | * **Summary:** |
| The Union-Find data structure stores a representation of components in such a way that supports rapid searching and updating (i.e. you start with a graph that has only nodes and no edges and add edges keeping track of connected components the whole time). This is what you need to efficiently implement Kruskal's Algorithm. The Union-Find data structure should maintain disjoint sets such that Find(u) gives the "name" of the connected component that u is in. If Find(u) = Find(v) then they're in the same connected component. So in Kruskal's Algorithm, if they're in the same connected component already, we would NOT add an edge between them. Operations: MakeUnionFind(S) - for a set S, returns a Union-Find where all of the elements are separate sets in O(n) time (i.e. no edges yet); Find(u) - already explained (works in O(logn) time); Union(A,B) - for two sets A and B, merges them into a single set in O(logn) time. A simple data structure to use with Union-Find is using an array that keeps track of what component each element is in and updates all of the elements of the larger component when components are combined. For this, Find takes O(1) time. Union can take as long as O(n), but for Kruskal's it will take only O(nlogn) overall. You can also implement this using pointers, which is a little more complicated (but it's all explained around p. 154). This lets you do a union in O(1) time, but makes Find take O(logn) time. The book discusses a way to make Find run a little faster in certain cases, but it won't really help in any of the applications they/we use the data structure for (because other operations will determine the running time). The book then goes into exactly how to implement Kruskal's Algorithm with the Union-Find data structure so that Kruskal's takes O(mlogn) time. |
| |
| * **Insights and Questions:** |
| It's really cool that this specific data structure can make this run in faster time! This is something I had trouble seeing earlier, and this section definitely helped. |
| |
| * **Readability:** |
| The stuff about the pointers was really hard to read. That's probably partly because we didn't get into that in class and because pointers bring back nightmares from 209 (haha just kidding). Anyway, it just took more concentration to understand that part. |
| ===== Section 7: Clustering ===== |
| |
| * **Summary:** |
| Oh hey the answer to other times that a minimum spanning tree can be useful! (We did this in class ... why didn't I still have to ask that question earlier haha?). Clustering comes up when you've got a collection of things and you're trying to separate them into coherent groups. How do you tell how similar objects are? The distance function - the farther away two objects are, the less similar they are (the closer objects are the more likely they are to be in the same cluster). If you're trying to find a clustering of maximum spacing, then you want the shortest distance between clusters to be as big as possible. They explain the "long way" of doing this, but basically if you want k clusters, you should delete the k biggest edges from the minimum spanning tree. Then those are the shortest distances between the clusters and they're at a maximum. They also prove that it works. |
| |
* **Insights and Questions:** | * **Insights and Questions:** |
| They talk about the distance function being one way to cluster, but there must be other things to base it on. Also, in the context of pictures that they mentioned, the pixel stuff they talked about is probably not the best way to cluster pictures into similar groups the way you'd really want to in real life. Interesting stuff. |
| |
* **Readability:** | * **Readability:** |
| Pretty easy to read. They sort of "went through with" the explanation of the "long way" (i.e. non-minimum-spanning tree explanation) to the solution more than we did before they started talking about the minimum spanning tree. I found that interesting. |
| |
===== Section 7: Clustering ===== | ===== Section 8: Huffman Codes and Data Compression ===== |
| |
* **Summary:** | * **Summary:** |
| This problem concerns data compression. In order to represent text in symbols, you have to change them into bits. One way to do this would be to use a fixed number of bits to represent each letter (ex. if you've got 32 symbols to represent, you'd use 5 bits for each letter). But we want to reduce the average number of bits that it takes to represent letters. You can take advantage of nonuniform frequencies in the number of times each letter comes up to do this (ex. the vowels come up a lot more than the letter 'z' so we really want the representation for vowels to be short, but z can take more than 5 bits if it needs to). Morse code is one good example of this, but they have dots, dashes and pauses (so that's three symbols, not just two like binary). If you converted Morse code into 0s and 1s without having pauses around, you start to run into prefix issues with ambiguity. Ex. if 'a' is encoded to 00 and 'b' is encoded to 001 and 'c' is encoded to 1, then 001 could mean ac or just b. They introduce some notation for making the problem precise. Mostly what's important is that ABL is the average bit length (p. 166 if I need to review). We can use binary trees to encode symbols with different bit lengths without having that prefix issue. All of the leaves are symbols and none of the other nodes are symbols. The path to the leaf/symbol is the encoding of it such that if you take a "left" path, that's a 0 and if you take a "right" path, that's a 1. Pictures are helpful for this (see slides). The book proves that the optimal tree for this is full. They then go into the "intuitive" way to try to create this tree, which is top-down, but that doesn't work. The actual solution for creating this tree is bottom-up. There will always be at least two leaves at the bottom layer of the tree (because it's definitely full). So the smallest two frequencies should be siblings. Then you combine their frequency and the mini tree that's just those two has a bigger frequency. So then you take the two smallest frequencies again out of all of the other frequencies and this new frequency and put the two smallest into a tree together and basically recursively do this. The book says this more formally, proves that this has to be optimal (p. 174) and explains why the overall running time is O(klogk) if k is the number of symbols you're encoding. Sometimes this isn't actually the best way to encode data; they go into some scenarios where it wouldn't have optimal compression. |
| |
* **Insights and Questions:** | * **Insights and Questions:** |
| Huffman codes may not always be the best for every set of data, but they are still used for compressing text, aren't they? Because the frequencies of letters in the English language is pretty consistent across different texts. |
| |
* **Readability:** | * **Readability:** |
| This was a long section, but it was pretty easy to read, except the math they introduced with summations was a little hard to figure out. |
===== Section 9: Minimum-Cost Arborescences: A Multi-Phase Greedy Algorithm ===== | ===== Section 9: Minimum-Cost Arborescences: A Multi-Phase Greedy Algorithm ===== |
| |