Table of Contents
Entry Six
Chapter 4.7
Clustering
How do minimum spanning trees play a role in clustering???
Clustering is when you organize a collection of objects into coherent groups. How might we organize these groups? By how similar they are. We use a distance function on the objects with the assumption that those closer together are more similar to each other.
The Problem: Given a set of objects and a distance function on the objects, divide the objects into groups so that objects in the same group are close and ones in different groups are far apart.
Clusterings of Maximum Spacing- Given a set of objects each pair has a numeric distance and that distance is greater than 0 for distinct objects and the distances are symmetric. Given the number of groups k, we want to divide the set of objects into k groups. A k-clustering is a partition of the set into k non-empty sets. The spacing of a clustering is the minimum distance between nay pair of points in different clusters. Our goals is to maximize the minimum spacing. We want points in different clusters to be as far apart as possible from each other.
The Algorithm:
Grow a graph on the vertex set U edge by edge with connected components corresponding to clusters.
- draw an edge between the closest pair of points
- add edge between next closest pair of points
- continue in order of increasing distance
- if two points pi and pj are already in the same cluster, don't add an edge between them (b/c it does not change the set of components and so we never create a cycle)
The graph is a union of trees. The algorithm is just like Kruskal's but instead of continuing until we have one connected component, we stop at k connected components. We stop before it adds k-1 edges. We are creating a minimum spanning tree and deleting the longest edges.
(4.26) The components C1, C2,…,Ck formed by deleting the k-1 most expensive edges of the minimum spanning tree T constitute a k-clustering of maximum spacing.
Proof:
Let C be the clustering we produce from our algorithm. The spacing of our solution is the length of the k-1th most expensive edge in the MST. Consider another solution C' we will show that the spacing of C' is at most the spacing of C. Since the clusterings are not the same there are points that belong to different clusters in C'. Because they both belong in a different cluster in C' it means that each edge has a length of at most the spacing of solution C. ~*This proof in the book is really wordy and confusing*~
This section was a really short and easy read, other than the proof I thought it was decent. Readability 9/10.
Chapter 4.8
Huffman Codes and Data Compression
In this problem we use greedy rules to “shrink the size of the problem instance, so that an equivalent smaller problem can then be solved by recursion.”
The Problem:
Encoding Symbols Using Bits
- we want to use a fixed number of bits to encode the letters of the alphabet
- we want to make this more efficient by using a smaller number of bits for frequent letters and larger number of bits for less frequently used letters
- always want to represent files as compactly as possible
- compression algorithms reduce the space of input files through encoding schemes
Variable-Length Encoding Schemes
- Morse code- translating letters into sequence of dots and dashes
- such short strings makes the encoding ambiguous
Prefix Codes
- ambiguity happens when a pair of letters where the bit string that encodes one letter is the prefix of the bit string that encodes another letter
- we have to map letters to bit strings so no encoding is a prefix to any other
- a prefix code is a function that maps each letter to a sequence of 0's and 1's so that for distinct letters the sequence of either of those letters is not a prefix to the other
Optimal Prefix Codes
- we want to take advantage of some letters being more frequent than other when encoding our letters
- each letter has its own frequency, for n letters in the set, the total frequencies of all n letters equals 1
- To fine the total length of our encoding we take the sum of the number of times a letter occurs times the length of the bit string used to encode that letter
- a prefix code that minimizes the average number of bits per letter is an optimal prefix code
The Algorithm:
We use a greedy method to construct an optimal prefix code by developing tree representation of the codes.
Representing Prefix Codes Using Binary Trees
- binary tree- rooted tree in which each node that is not a leaf has at most two children
- the number of leaves is = to the size of the alphabet
- follow the path from the root to the leaf– if the path goes left record a 0, if the path goes right record a 1
(4.27) The encoding of S constructed from T is a prefix code
Proof: The letters can only be leaves, so an encoding of x cannot be a prefix to the encoding of y because x is not a node on the path to y.
We can also build a tree from a prefix code recursively!! So the search for a binary tree is the search for a prefix code. We want the tree that minimizes the weighted average (frequencies of the letters) of the depths of all leaves.
A binary tree is full if each node that is not a leaf has two children.
(4.28) The binary tree corresponding to the optimal prefix code is full. (p.168)
We prove this using an exchange argument.
The Top-Down Approach
- we want to produce a tree with the smallest average leaf depth
- try and split the alphabet into as nearly balanced split of the total frequencies as possible
- recursively construct prefix codes for the split sets independently
- this does not guarantee always producing an optimal prefix code
If we knew the binary tree corresponding to an optimal prefix code but didn't know which letters the leaves were we can complete the tree with an easy fact. (4.31) There is an optimal prefix code, with corresponding tree T*, in which the two lowest-frequency letters are assigned to leaves that are siblings in T*.
Lock the two lowest frequencies together into a metaletter…recursive strategy:
Huffman's Algorithm:
To construct a prefix code fo an alphabet S, with given frequencies: If S has two letters then Encode one letter using 0 and the other letter using 1 Else Let y* and z* be the two lowest-frequency letters Form a new alphabet S' by deleting y* and z* and replacing them with a new letter w of frequency f(y)+f(z) Recursively construct a prefix code A' for S', with tree T' Define a prefix code for S as follows: Start with T' Take the leaf labeled w and add two children below it labeled y* and z* Endif
The prefix code this algorithm produces is called Huffman code. This is a bottom-up approach by focusing on leaves with two lowest frequencies and moving up to leaves with highest frequencies.
Analysis:
We prove the optimality of Huffman's by induction. The proof is really long and complicated but follows to prove by contradiction that the tree Huffman's produces is optimal. The total runtime of Huffman's algorithm is O(k log k). If we use a PQ by a heap we can inset and extract the minimum in O(log k) time which we do a total of k times.
This section was realllllly long to read which made it more boring, but the presentation of it in class was much more interesting. The proof was definitely a part of the chapter that I will have to come back to to understand. Four pages of a lot of variables was intimidating. I would rate this readability: 7/10
Chapter 5.1
A First Recurrence: The Mergesort Algorithm
In this section we are beginning divide-and-conquer algorithms!!
Mergesort
- sorts given list of numbers
- divides list in two equal parts
- sort each half recursively
- combine the results spending linear time for the initial division and final recombining
Let T(n) be the worst-case running time of the algorithm on a list of size n. To divide the list into two even parts (n/2) takes O(n) time. For each division it takes T(n/2) to solve the sorting. It then spends O(n) to combine the lists back together. SO….
(5.1) For some constant c, T(n) =< 2T(n/2) + cn when n > 2 and T(2) =< c.
T(n) is bounded by an inequality. There is a base case that says T(n) is equal to a constant when n is a constant. We have to solve the recurrence so that T is only on the LHS of the inequality.
How to solve recurrences:
- “unroll” the recursion by accounting for the run time in the first few levels then specify a pattern for as the recursion expands. Sum the runtimes over all levels
- start with a guess and check if it works by substituting it into the recursion. Justify by induction!
Unrolling the Mergesort Recurrence
- level 1: problem size n, takes cn time
- level 2: two problems size n/2 each takes cn/2 time for a total of at most cn
- level 3: four problems size n/4 each take cn/4 time for total of at most cn
- pattern: at level j of the recursion the number of subproblems is 2j with size n/2j so takes at most cn/2j time
- sum over all levels: number of times the input is halved is log n so the sum of the cn work on log n levels is O(n log n)
Substituting a Solution into the Mergesort Recurrence
I would rate this chapter a readability of 8/10. It was fairly short and easy to follow but if I hadn't read this chapter after it was presented in class I would have been a little lost.