论文信息 - Web-scale k-means clustering

Web-scale k-means clustering

We present two modifications to the popular k-means clustering algorithm to address the extreme requirements for latency, scalability, and sparsity encountered in user-facing web applications. First, we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent. Second, we achieve sparsity with projected gradient descent, and give a fast ε-accurate projection onto the L1-ball. Source code is freely available: http://code.google.com/p/sofia-ml

D. Sculley | D. Sculley

[1] Yoshua Bengio,et al. Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[2] Charles Elkan,et al. Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[3] Yiming Yang,et al. RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[4] Yoram Singer,et al. Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[5] Xindong Wu,et al. The Top Ten Algorithms in Data Mining , 2009 .

[6] Robert Tibshirani,et al. A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.