====== Chapter 4.7: Clustering ====== The problem of clustering arises whenever there is a collection of objects that you want to classify or organize into coherent groups. Naturally, the first thing you do is check how similar or dissimilar each pair of objects is. This is commonly known as a **distance function**, where the objects at a larger distance from one another are less similar to each other. Given a distance function on the objects, the clustering problem seeks to divide them into groups so that objects within the same group are "close" and objects in different groups are "far apart." Our goal is to find a clustering of maximum spacing. Suppose we are given a set //U// of //n// objects, labeled //p_1, p_2,..., p_n//. For each pair, //p_i, p_j//, we have a distance //d(p_i, p_j)//, where //d(p_i, p_i)=0// and //d(p_i, p_j) > 0// for distinct //p_i// and //p_j// and that distances are symmetric. We want to divide the objects in //U// into //k// groups, for a given parameter //k//. A **k-clustering** of //U// is a partition of //U// into //k// nonempty sets //C_1, C_2,..., C_k//. The **spacing** of a //k//-clustering is the minimum distance between any pair of points lying in different clusters. We want to maximize this minimum distance. ===== The Algorithm ===== To find a clustering of maximum spaces, we grow a graph on the vertex set //U//. The connected components will be clusters. We start by drawing an edge between the closest pair of points. Then we draw an edge between the next closest pair of points. We continue in this way, adding edges in order of increasing distance. We only add an edge, however, between points //p_i// and //p_j// if they do not already belong to the same cluster. So, our graph-growing process will never create a cycle (each cluster will be a tree). We stop once we have //k// clusters/trees. This is basically Kruskal's Algorithm, but we stop when we have //k// components rather than when we have 1 component. In other words, we stop the algorithm before it adds the last //k-1// edges. This is the same as taking the entire spanning tree and deleting the //k-1// most expensive edges. This algorithm produces a //k//-clustering of maximum spacing. I would rate this section a 10. It was very easy to read and understand, especially after the lecture (which was kind of tricky). This definitely solidified my understanding of clustering.