Randomized algorithms for fast Bayesian hierarchical clustering

We present two new algorithms for fast Bayesian Hierarchical Clustering on large data sets. Bayesian Hierarchical Clustering (BHC) [1] is a method for agglomerative hierarchical clustering based on evaluating marginal likelihoods of a probabilistic model. BHC has several advantages over traditional distancebased agglomerative clustering algorithms. It defines a probabilistic model of the data and uses Bayesian hypothesis testing to decide which merges are advantageous and to output the recommended depth of the tree. Moreover, the algorithm can be interpreted as a novel fast bottom-up approximate inference method for a Dirichlet process (i.e. countably infinite) mixture model (DPM). While the original BHC algorithm has O(n) computational complexity, the two new randomized algorithms are O(n log n) and O(n).

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[3]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.