Lambda means clustering: Automatic parameter search and distributed computing implementation

Recent advances in clustering have shown that ensuring a minimum separation between cluster centroids leads to higher quality clusters compared to those found by methods that explicitly set the number of clusters to be found, such as k-means. One such algorithm is DP-means, which sets a distance parameter λ for the minimum separation. However, without knowing either the true number of clusters or the underlying true distribution, setting λ itself can be difficult, and poor choices in setting λ will negatively impact cluster quality. As a general solution for finding λ, in this paper we present λ-means, a clustering algorithm capable of deriving an optimal value for λ automatically. We contribute both a theoretically-motivated cluster-based version of λ-means, as well as a faster conflict-based version of λ-means. We demonstrate that λ-means discovers the true underlying value of λ asymptotically when run on datasets generated by a Dirichlet Process, and achieves competitive performance on a real world test dataset. Further, we demonstrate that when run on both parallel multicore computers and distributed cluster computers in the cloud, cluster-based λ-means achieves near perfect speedup, and while being a more efficient algorithm, conflict-based λ-means achieves speedups only a factor of two away from the maximum-possible.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[3]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[4]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[5]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[6]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[7]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Guillermo Sapiro,et al.  Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations , 2009, NIPS.

[9]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[10]  Michael I. Jordan,et al.  Optimistic Concurrency Control for Distributed Unsupervised Learning , 2013, NIPS.

[11]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[12]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[13]  Ting-Li Chen,et al.  $\gamma$-SUP: A clustering algorithm for cryo-electron microscopy images of asymmetric particles , 2012, 1205.2034.

[14]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[15]  ChengYizong Mean Shift, Mode Seeking, and Clustering , 1995 .