Section 4.6 Implementing Kruskal's Algorithm: The Union-Find Data Structure

Kruskal's Algorithm requires a way to rapidly identify the connected components to which two nodes on either end of an edge belong. This requirement is met by a data structure called the Union-Find Data Structure. This data structure supports three operations: MakeUnionFind(S) which returns a Union-Find data structure on the set S where each element is in its own set, Find(u) which returns the name of the set containing node u, and Union(A, B) which merges sets A and B into a single set.

The chapter discusses two different implementations for the Union-Find data structure, but the first is largely introduced as a comparative tool for the operation of the second so I will omit it in these notes since many components are similar.

The Union-Find data structure favored in the section makes use of pointers. Initially, the MakeUnionFind(S) operation makes a record for each element in the set and sets each elements pointer to point to the element itself. When two disjoint sets are united, the pointer of one of them will be reset to point to the other node, which now is the name of the set. The rest of the pointers in the joined sets do not need to be reset. This means that now to find the name of the set which contains a node u, pointers must be followed until there are no more remaining, at which point we have reached the eponymous node of the set. For this implementation, Union will require constant time since it is merely updating a pointer, MakeUnionFind(S) will require linear time on the nodes of the set and Find will require O(log n) time. There is a way to improve the running time of the Find function, however. This process is called compression of the path traversed by a call of the Find function and it works by resetting all the pointers along that does not change the O(log n) worst-case running time. The real gain comes from subsequent calls of Find along that path, since they now run in O(k) time!

To implement Kruskal's algorithm using a Union-Find data structure, the edges must first be sorted which takes O(m log n) time. Then, the ends of each edge are checked for their connected component using a Find call for each. If they are not in the same connected component then the connected components are united with a Union call on the Finds for each node. There will be at most 2 Find calls for each edge and n-1 Union calls and using either the array-based Union-Find or the pointer-based implementation of the data structure will yield a O(m log n) running time.

I didn't find this chapter to be too dense and the diagrams filled in the gaps fairly well so I would give it a reasonably high readability.