Minimum spanning trees and clusters

The Euclidean minimum spanning tree for a set of points is the shortest tree connecting all the points. It can be used for clustering, by dropping the longest edges. There are three related greedy algorithms for the minimum spanning tree, all of which come very close to the lower bound for asymptotic complexity. Kruskal’s algorithm (also known as ‘single linkage clustering’) adds the shortest edge that does not form a cycle, and ends up with a tree only at the last step. Prim’s algorithm maintains a tree at all times, adding the shortest edge that connects a point in the tree to a point outside the tree. Boruvka’s algorithm maintains a forest that is reduced at each step by connecting each tree to its closest neighbour. Kruskal’s algorithm is efficient for sparse graphs but requires O(n) space for a Euclidean spanning tree on n points. Boruvka’s algorithm is attractive largely because it can be run in parallel very efficiently. Prim’s algorithm is easiest to implement for large Euclidean minimum spanning trees. For general graphs or for very high dimensional spaces, an efficient way to implement Prim’s algorithm is for each point outside the tree to keep track of its nearest neighbour in the tree. These potential links are then placed in a priority queue. The algorithms are described in detail in Sedgewick’s Algorithms. For moderate-dimensional Euclidean space there is more efficient strategy, relying on the ability to find nearest neighbours efficiently using a k-d tree. Each point in the tree keeps track of its nearest neighbour outside the tree. These potential links are placed in a priority queue ordered by length, and the shortest link is removed and used at each step. Lazy updating is used: when a link reaches the front of the queue, its target point may already have been added, in which case it is updated with the new nearest neighbour outside the tree and replaced in the queue. If the data are in the form of tight, well-separated clusters, the nearestneighbour finder will be highly efficient towards the center of the cluster, but slower when dealing with outliers and debris. Combining features of Boruvka’s and Prim’s algorithms we can run Prim’s algorithm until the next link to be added is longer than a threshold, then restart the tree at a randomly chosen point. The result is likely to be a small forest of large trees representing the distinct clusters, with an additional small tree representing the isolated points. Approaches similar to this were first developed by Bentley & Friedman (1978),