This is an old revision of the document!

Entry Six

Chapter 4.7

Clustering

How do minimum spanning trees play a role in clustering???

Clustering is when you organize a collection of objects into coherent groups. How might we organize these groups? By how similar they are. We use a distance function on the objects with the assumption that those closer together are more similar to each other.

The Problem: Given a set of objects and a distance function on the objects, divide the objects into groups so that objects in the same group are close and ones in different groups are far apart.

Clusterings of Maximum Spacing- Given a set of objects each pair has a numeric distance and that distance is greater than 0 for distinct objects and the distances are symmetric. Given the number of groups k, we want to divide the set of objects into k groups. A k-clustering is a partition of the set into k non-empty sets. The spacing of a clustering is the minimum distance between nay pair of points in different clusters. Our goals is to maximize the minimum spacing. We want points in different clusters to be as far apart as possible from each other.

The Algorithm:

Grow a graph on the vertex set U edge by edge with connected components corresponding to clusters.

draw an edge between the closest pair of points
add edge between next closest pair of points
continue in order of increasing distance
if two points p_i and p_j are already in the same cluster, don't add an edge between them (b/c it does not change the set of components and so we never create a cycle)

The graph is a union of trees. The algorithm is just like Kruskal's but instead of continuing until we have one connected component, we stop at k connected components. We stop before it adds k-1 edges. We are creating a minimum spanning tree and deleting the longest edges.

(4.26) The components C₁, C₂,…,C_k formed by deleting the k-1 most expensive edges of the minimum spanning tree T constitute a k-clustering of maximum spacing.

Proof:

Let C be the clustering we produce from our algorithm. The spacing of our solution is the length of the k-1th most expensive edge in the MST. Consider another solution C' we will show that the spacing of C' is at most the spacing of C. Since the clusterings are not the same there are points that belong to different clusters in C'. Because they both belong in a different cluster in C' it means that each edge has a length of at most the spacing of solution C. ~*This proof in the book is really wordy and confusing*~

This section was a really short and easy read, other than the proof I thought it was decent. Readability 9/10.

Chapter 4.8

Huffman Codes and Data Compression

In this problem we use greedy rules to “shrink the size of the problem instance, so that an equivalent smaller problem can then be solved by recursion.”

The Problem:

Encoding Symbols Using Bits

we want to use a fixed number of bits to encode the letters of the alphabet
we want to make this more efficient by using a smaller number of bits for frequent letters and larger number of bits for less frequently used letters
always want to represent files as compactly as possible
compression algorithms reduce the space of input files through encoding schemes

Variable-Length Encoding Schemes

Morse code- translating letters into sequence of dots and dashes
such short strings makes the encoding ambiguous

Prefix Codes

ambiguity happens when a pair of letters where the bit string that encodes one letter is the prefix of the bit string that encodes another letter
we have to map letters to bit strings so no encoding is a prefix to any other
a prefix code is a function that maps each letter to a sequence of 0's and 1's so that for distinct letters the sequence of either of those letters is not a prefix to the other

Optimal Prefix Codes

we want to take advantage of some letters being more frequent than other when encoding our letters
each letter has its own frequency, for n letters in the set, the total frequencies of all n letters equals 1
To fine the total length of our encoding we take the sum of the number of times a letter occurs times the length of the bit string used to encode that letter
a prefix code that minimizes the average number of bits per letter is an optimal prefix code

The Algorithm:

We use a greedy method to construct an optimal prefix code by developing tree representation of the codes.

Representing Prefix Codes Using Binary Trees

binary tree- rooted tree in which each node that is not a leaf has at most two children
the number of leaves is = to the size of the alphabet
follow the path from the root to the leaf– if the path goes left record a 0, if the path goes right record a 1

(4.27) The encoding of S constructed from T is a prefix code

Proof

The letters can only be leaves, so an encoding of x cannot be a prefix to the encoding of y because x is not a node on the path to y.

We can also build a tree from a prefix code recursively!! So the search for a binary tree is the search for a prefix code. We want the tree that minimizes the weighted average (frequencies of the letters) of the depths of all leaves

Table of Contents

Entry Six

Chapter 4.7

Chapter 4.8