Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revision | |
courses:cs211:winter2012:journals:ian:chapter4 [2012/03/01 03:42] – 4.6 sturdyi | courses:cs211:winter2012:journals:ian:chapter4 [2012/03/07 05:08] (current) – 7 8 sturdyi |
---|
* No questions. | * No questions. |
* No complaints. | * No complaints. |
| |
| ===== 4.7 Clustering ===== |
| |
| * A problem related to finding MSTs is finding the clusterings in a graph, the grouping that maximizes minimum inter-group edges. Thus can be achieved with only slight adaptation of Kruskal's algorithm for finding MSTs; rather than stopping when all nodes are connected, stop when there are as many connected sets in the UFDS as desired. This produces a maximum-spacing clustering, since this algorithm has spacing equal to the first edge that Kruskal's algorithm would have added after it was stopped, and any other clustering must split nodes that this algorithm added before that edge, which, since Kruskal's algorithm goes by increasing edge length, must be less than the spacing the algorithm in question produced. |
| * I am somewhat surprised to see no mention of runtime, even if it is fairly obviously the same as Kruskal's. |
| * No questions. |
| * No complaints. |
| |
| ===== 4.8 Huffman Codes and Data Compression ===== |
| |
| * Computers encode characters as bits, and so some easy compression can come from finding an encoding that minimizes average bit-length. Since over sufficiently large samples of text of a given language character frequencies are fairly predictable, some compression can come from adopting an encoding tailored to certain uses, as Morse Code was. But Morse Code was not optimized for the same constraints as computer encodings are; humans are accustomed to the clear dead-air letter and word spacing, but that forces Morse Code to be ternary, with dits, dahs, and spaces; since some letter codes are prefixes of others, omitting the spaces would lead to ambiguity. Therefore, computer encodings should be prefix codes, where no code is a prefix of another. One can represent prefix codes as binary trees, where each node represents a single bit of a code, and only leafs carry characters (the code then being the sequence of bits at the nodes from the root to the parent); an optimal prefix code will be represented as a balanced tree. Given a tree, characters should obviously be assigned to leaves on increasing levels by decreasing probability, so that the most common characters correspond to the leaves closest to the root. One can find the tree for the optimal code by proceeding through the characters by increasing frequency; if there are only two characters to be encoded, encode one as 0 and the other as 1 arbitrarily, and if there are more group the least frequent two into a meta-character with frequency the sum of their frequencies. Lastly, expand the meta-characters by assigning one child to 0 and the other to 1, recursively expanding any meta-characters among those children. Optimality can be established by induction; the Huffman algorithm optimizes a two-character set, and so it optimizes each level. Using a priority Queue, it runs in O(k log k), as k iterations of a log k operation. |
| * No insights. |
| * No questions. |
| * I dislike presentation of the wrong solution first, except when giving a history of the development of the algorithm; I am wary of making someone's first association be a faulty algorithm. But otherwise well written. |
| |
| |