论文信息 - Lambda means clustering: Automatic parameter search and distributed computing implementation

Lambda means clustering: Automatic parameter search and distributed computing implementation

Recent advances in clustering have shown that ensuring a minimum separation between cluster centroids leads to higher quality clusters compared to those found by methods that explicitly set the number of clusters to be found, such as k-means. One such algorithm is DP-means, which sets a distance parameter λ for the minimum separation. However, without knowing either the true number of clusters or the underlying true distribution, setting λ itself can be difficult, and poor choices in setting λ will negatively impact cluster quality. As a general solution for finding λ, in this paper we present λ-means, a clustering algorithm capable of deriving an optimal value for λ automatically. We contribute both a theoretically-motivated cluster-based version of λ-means, as well as a faster conflict-based version of λ-means. We demonstrate that λ-means discovers the true underlying value of λ asymptotically when run on datasets generated by a Dirichlet Process, and achieves competitive performance on a real world test dataset. Further, we demonstrate that when run on both parallel multicore computers and distributed cluster computers in the cloud, cluster-based λ-means achieves near perfect speedup, and while being a more efficient algorithm, conflict-based λ-means achieves speedups only a factor of two away from the maximum-possible.

[1] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2] Thomas Hofmann,et al. Map-Reduce for Machine Learning on Multicore , 2007 .

[3] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .

[4] J. T. Robinson,et al. On optimistic methods for concurrency control , 1979, TODS.

[5] James Bailey,et al. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[6] Michael I. Jordan,et al. Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[7] Yizong Cheng,et al. Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[8] Guillermo Sapiro,et al. Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations , 2009, NIPS.

[9] Greg Hamerly,et al. Learning the k in k-means , 2003, NIPS.

[10] Michael I. Jordan,et al. Optimistic Concurrency Control for Distributed Unsupervised Learning , 2013, NIPS.

[11] Philip Chan,et al. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[12] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[13] Ting-Li Chen,et al. $\gamma$-SUP: A clustering algorithm for cryo-electron microscopy images of asymmetric particles , 2012, 1205.2034.

[14] Robert Tibshirani,et al. Estimating the number of clusters in a data set via the gap statistic , 2000 .

[15] ChengYizong. Mean Shift, Mode Seeking, and Clustering , 1995 .