Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision |
courses:cs211:winter2012:journals:ian:chapter4 [2012/03/01 02:10] – 4.5 sturdyi | courses:cs211:winter2012:journals:ian:chapter4 [2012/03/07 05:08] (current) – 7 8 sturdyi |
---|
| |
===== 4.6 Union-Find Data Structure ===== | ===== 4.6 Union-Find Data Structure ===== |
| |
| * Kruskal's algorithm requires the ability to efficiently tell whether adding an edge would create a cycle. One way to do this is to track the connected subgraphs created by the edges already added; if both ends of an edge are in the same subgraph, adding it would create a cycle. At its simplest, a UFDS could be an array giving the name of the set to which each node belongs, but then the Union operation would be linear. By keeping the name of the larger and maintaining lists of the nodes belonging to each set, Find takes O(1) and k Union operations take O(k log k) time. But this can be improved by using a reverse tree, where nodes have a pointer to their parent. By again always keeping the name of the larger set these inverse trees (roots?) can be kept balanced, so that Union is O(1) and Find is O(log n). Further non-asymptotic operations can be achieved by, on each completed find, updating the pointer to point to the new root, quickening future operations. |
| * Nice to see this finally explained. |
| * No questions. |
| * No complaints. |
| |
| ===== 4.7 Clustering ===== |
| |
| * A problem related to finding MSTs is finding the clusterings in a graph, the grouping that maximizes minimum inter-group edges. Thus can be achieved with only slight adaptation of Kruskal's algorithm for finding MSTs; rather than stopping when all nodes are connected, stop when there are as many connected sets in the UFDS as desired. This produces a maximum-spacing clustering, since this algorithm has spacing equal to the first edge that Kruskal's algorithm would have added after it was stopped, and any other clustering must split nodes that this algorithm added before that edge, which, since Kruskal's algorithm goes by increasing edge length, must be less than the spacing the algorithm in question produced. |
| * I am somewhat surprised to see no mention of runtime, even if it is fairly obviously the same as Kruskal's. |
| * No questions. |
| * No complaints. |
| |
| ===== 4.8 Huffman Codes and Data Compression ===== |
| |
| * Computers encode characters as bits, and so some easy compression can come from finding an encoding that minimizes average bit-length. Since over sufficiently large samples of text of a given language character frequencies are fairly predictable, some compression can come from adopting an encoding tailored to certain uses, as Morse Code was. But Morse Code was not optimized for the same constraints as computer encodings are; humans are accustomed to the clear dead-air letter and word spacing, but that forces Morse Code to be ternary, with dits, dahs, and spaces; since some letter codes are prefixes of others, omitting the spaces would lead to ambiguity. Therefore, computer encodings should be prefix codes, where no code is a prefix of another. One can represent prefix codes as binary trees, where each node represents a single bit of a code, and only leafs carry characters (the code then being the sequence of bits at the nodes from the root to the parent); an optimal prefix code will be represented as a balanced tree. Given a tree, characters should obviously be assigned to leaves on increasing levels by decreasing probability, so that the most common characters correspond to the leaves closest to the root. One can find the tree for the optimal code by proceeding through the characters by increasing frequency; if there are only two characters to be encoded, encode one as 0 and the other as 1 arbitrarily, and if there are more group the least frequent two into a meta-character with frequency the sum of their frequencies. Lastly, expand the meta-characters by assigning one child to 0 and the other to 1, recursively expanding any meta-characters among those children. Optimality can be established by induction; the Huffman algorithm optimizes a two-character set, and so it optimizes each level. Using a priority Queue, it runs in O(k log k), as k iterations of a log k operation. |
| * No insights. |
| * No questions. |
| * I dislike presentation of the wrong solution first, except when giving a history of the development of the algorithm; I am wary of making someone's first association be a faulty algorithm. But otherwise well written. |
| |
| |