This is an old revision of the document!
Table of Contents
Entry Six
Chapter 4.7
Clustering
How do minimum spanning trees play a role in clustering???
Clustering is when you organize a collection of objects into coherent groups. How might we organize these groups? By how similar they are. We use a distance function on the objects with the assumption that those closer together are more similar to each other.
The Problem: Given a set of objects and a distance function on the objects, divide the objects into groups so that objects in the same group are close and ones in different groups are far apart.
Clusterings of Maximum Spacing- Given a set of objects each pair has a numeric distance and that distance is greater than 0 for distinct objects and the distances are symmetric. Given the number of groups k, we want to divide the set of objects into k groups. A k-clustering is a partition of the set into k non-empty sets. The spacing of a clustering is the minimum distance between nay pair of points in different clusters. Our goals is to maximize the minimum spacing. We want points in different clusters to be as far apart as possible from each other.
The Algorithm:
Grow a graph on the vertex set U edge by edge with connected components corresponding to clusters.
- draw an edge between the closest pair of points
- add edge between next closest pair of points
- continue in order of increasing distance
- if two points pi and pj are already in the same cluster, don't add an edge between them (b/c it does not change the set of components and so we never create a cycle)
The graph is a union of trees. The algorithm is just like Kruskal's but instead of continuing until we have one connected component, we stop at k connected components. We stop before it adds k-1 edges. We are creating a minimum spanning tree and deleting the longest edges.
(4.26) The components C1, C2,…,Ck formed by deleting the k-1 most expensive edges of the minimum spanning tree T constitute a k-clustering of maximum spacing.
Proof:
Let C be the clustering we produce from our algorithm. The spacing of our solution is the length of the k-1th most expensive edge in the MST. Consider another solution C' we will show that the spacing of C' is at most the spacing of C. Since the clusterings are not the same there are points that belong to different clusters in C'. Because they both belong in a different cluster in C' it means that each edge has a length of at most the spacing of solution C. ~*This proof in the book is really wordy and confusing*~
This section was a really short and easy read, other than the proof I thought it was decent. Readability 9/10.
Chapter 4.8
Huffman Codes and Data Compression
In this problem we use greedy rules to “shrink the size of the problem instance, so that an equivalent smaller problem can then be solved by recursion.”
The Problem:
Encoding Symbols Using Bits
- we want to use a fixed number of bits to encode the letters of the alphabet
- we want to make this more efficient by using a smaller number of bits for frequent letters and larger number of bits for less frequently used letters
- always want to represent files as compactly as possible
- compression algorithms reduce the space of input files through encoding schemes
Variable-Length Encoding Schemes
- Morse code- translating letters into sequence of dots and dashes
- such short strings makes the encoding ambiguous
Prefix Codes
- ambiguity happens when a pair of letters where the bit string that encodes one letter is the prefix of the bit string that encodes another letter
- we have to map letters to bit strings so no encoding is a prefix to any other
- a prefix code is a function that maps each letter to a sequence of 0's and 1's so that for distinct letters the sequence of either of those letters is not a prefix to the other
Optimal Prefix Codes
- we want to take advantage of some letters being more frequent than other when encoding our letters
- each letter has its own frequency, for n letters in the set, the total frequencies of all n letters equals 1
- To fine the total length of our encoding we take the sum of the number of times a letter occurs times the length of the bit string used to encode that letter
- a prefix code that minimizes the average number of bits per letter is an optimal prefix code
The Algorithm:
We use a greedy method to construct an optimal prefix code by developing tree representation of the codes.
Representing Prefix Codes Using Binary Trees
- binary tree- rooted tree in which each node that is not a leaf has at most two children
- the number of leaves is = to the size of the alphabet
- follow the path from the root to the leaf– if the path goes left record a 0, if the path goes right record a 1
(4.27) The encoding of S constructed from T is a prefix code
Proof
The letters can only be leaves, so an encoding of x cannot be a prefix to the encoding of y because x is not a node on the path to y.
We can also build a tree from a prefix code recursively!! So the search for a binary tree is the search for a prefix code. We want the tree that minimizes the weighted average (frequencies of the letters) of the depths of all leaves