Chapter 4.7: Clustering

The problem of clustering arises whenever there is a collection of objects that you want to classify or organize into coherent groups. Naturally, the first thing you do is check how similar or dissimilar each pair of objects is. This is commonly known as a distance function, where the objects at a larger distance from one another are less similar to each other. Given a distance function on the objects, the clustering problem seeks to divide them into groups so that objects within the same group are “close” and objects in different groups are “far apart.” Our goal is to find a clustering of maximum spacing.

Suppose we are given a set U of n objects, labeled p_1, p_2,…, p_n. For each pair, p_i, p_j, we have a distance d(p_i, p_j), where d(p_i, p_i)=0 and d(p_i, p_j) > 0 for distinct p_i and p_j and that distances are symmetric. We want to divide the objects in U into k groups, for a given parameter k. A k-clustering of U is a partition of U into k nonempty sets C_1, C_2,…, C_k. The spacing of a k-clustering is the minimum distance between any pair of points lying in different clusters. We want to maximize this minimum distance.

The Algorithm

To find a clustering of maximum spaces, we grow a graph on the vertex set U. The connected components will be clusters. We start by drawing an edge between the closest pair of points. Then we draw an edge between the next closest pair of points. We continue in this way, adding edges in order of increasing distance. We only add an edge, however, between points p_i and p_j if they do not already belong to the same cluster. So, our graph-growing process will never create a cycle (each cluster will be a tree). We stop once we have k clusters/trees. This is basically Kruskal's Algorithm, but we stop when we have k components rather than when we have 1 component. In other words, we stop the algorithm before it adds the last k-1 edges. This is the same as taking the entire spanning tree and deleting the k-1 most expensive edges. This algorithm produces a k-clustering of maximum spacing.

I would rate this section a 10. It was very easy to read and understand, especially after the lecture (which was kind of tricky). This definitely solidified my understanding of clustering.