Differences

This shows you the differences between two versions of the page.

--- courses:cs211:winter2018:journals:holmesr:section_4.1 [2018/03/05 03:56] – holmesr
+++ courses:cs211:winter2018:journals:holmesr:section_4.1 [2018/04/19 20:08] (current) – admin
@@ Line 60: / Line 60: @@
 This section was quite dense to cut through due to the lengthy and notation-dependent proofs. The pictures helped somewhat but I did have to resort to class notes to demystify some of what was trying to be conveyed in the proofs.
+===== Section 4.6 Implementing Kruskal's Algorithm: The Union-Find Data Structure =====
+Kruskal's Algorithm requires a way to rapidly identify the connected components to which two nodes on either end of an edge belong. This requirement is met by a data structure called the Union-Find Data Structure. This data structure supports three operations: MakeUnionFind(S) which returns a Union-Find data structure on the set S where each element is in its own set, Find(u) which returns the name of the set containing node u, and Union(A, B) which merges sets A and B into a single set.
+The chapter discusses two different implementations for the Union-Find data structure, but the first is largely introduced as a comparative tool for the operation of the second so I will omit it in these notes since many components are similar.
+The Union-Find data structure favored in the section makes use of pointers. Initially, the MakeUnionFind(S) operation makes a record for each element in the set and sets each elements pointer to point to the element itself. When two disjoint sets are united, the pointer of one of them will be reset to point to the other node, which now is the name of the set. The rest of the pointers in the joined sets do not need to be reset. This means that now to find the name of the set which contains a node u, pointers must be followed until there are no more remaining, at which point we have reached the eponymous node of the set. For this implementation, Union will require constant time since it is merely updating a pointer, MakeUnionFind(S) will require linear time on the nodes of the set and Find will require O(log n) time. There is a way to improve the running time of the Find function, however. This process is called //compression// of the path traversed by a call of the Find function and it works by resetting all the pointers along that does not change the O(log n) worst-case running time. The real gain comes from subsequent calls of Find along that path, since they now run in O(k) time!
+To implement Kruskal's algorithm using a Union-Find data structure, the edges must first be sorted which takes O(m log n) time. Then, the ends of each edge are checked for their connected component using a Find call for each. If they are not in the same connected component then the connected components are united with a Union call on the Finds for each node. There will be at most 2 Find calls for each edge and n-1 Union calls and using either the array-based Union-Find or the pointer-based implementation of the data structure will yield a O(m log n) running time.
+I didn't find this chapter to be too dense and the diagrams filled in the gaps fairly well so I would give it a reasonably high readability.
+===== Section 4.7 Clustering =====
+Clustering problems revolves around classifying objects into groups based on their distance from one another. For some collections of items, this distance can be physical but for many others, the distance is a measure of similarity for which a distance function must be defined. The idea of a clustering is that items more like each other will be in the same cluster, while items less like each other will be in disparate clusters. The trouble is finding an efficient way to partition these objects into a specified number k clusters.
+It turns out that deleting the k-1 costliest edges from Kruskal's algorithm's minimum spanning tree will return a k-clustering of maximum spacing, with spacing being defined as the minimum distance between any pair of points lying in different clusters. The last k-1 edges are deleted from the tree returned by the algorithm because they are the most expensive edges and therefore their distance will be the longest in a weighted graph. We do not want these edges for a k-clustering because they would create a single connected component where k different connected components are desired.
+Since a clustering is yielded by an execution of Kruskal's algorithm that simply terminates before adding the last k-1 edges, it will use the same O(m log n) time that normal Kruskal's algorithm uses.
+I found this chapter surprisingly intuitive and stimulating and thus it was easily readable.
+===== Section 4.8 Huffman Codes and Data Compression =====
+This section begins by discussing different ways which one may encode data. One example discussed is the scheme using 1's and 0's which allows one to encode 2<sup>b</sup> symbols using b bits. Another scheme is one such as Morse Code, which uses different lengths of strings to encode different characters, based on frequency. Letters that occur more frequently, such as e, t, and a, use one- or two-bit strings while less frequent letters use four or five bit strings. This type of encoding is advantageous because the more frequent letters are quicker to encode, which saves time in the long run.
+The problem with such an encoding is that it is sometimes ambiguous. Since e is 0, t is 1, and 01 is a, a string 01 may either be et or a. The solution to this problem is adding a slight pause in between letters, but if the pause isn't long enough it may be missed and additionally the pause becomes its own type of bit.
+Such an ambiguity can be overcome by a construction called a prefix code. A prefix code is one in which the encoding for one letter in the alphabet can not be a prefix for another letter. In the Morse code example, e's encoding - 0, was a prefix for the encoding of a - 01. An example of a prefix code for the alphabet {a, b, c} could be a = 11, b = 01, c = 00. Using variable length encoding, it is possible to make an optimal prefix code by setting the most used letters to represent the shortest characters.
+A metric known as the ABL, or Average Bits per Letter, can be used to compare different variable length encodings and the lowest ABL is the optimal encoding. ABL is found by multiplying the length of each letter's encoding by its frequency in the document being encoded and then summing them together. There are two ways to accomplish this: Shannon-Fano codes and Huffman codes. Both these strategies involve building trees and assigning the a 0 to the left child and a 1 to the right child, but the way they build the trees differs vastly.
+The Shannon-Fano codes split the set of letters into two subsets of equal frequency and then recall themselves recursively on the subsets until each letter gets its own encoding. This top-down approach was a good start but a precursor to the superior Huffman codes. The important difference between Huffman codes and Shannon-Fano codes is that Huffman codes build up from the bottom, assigning the two least frequent letters to be children of a meta-letter and re-adding the meta-letter as a component of the alphabet for the algorithm to recursively operate on.
+This algorithm lends itself to the use of a priority queue because letters can be added into the priority queue with their frequencies being their keys. Now all that remains is insertion and deletion, both of which occur in O(log n) time. Since the algorithm is recursive and adds meta-letters back into the alphabet, it will carry out n loops which each use the aforementioned O(log n) time, thus giving O(n log n) total time.