论文信息 - Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce

Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce

The kernel $k$-means is an effective method for data clustering which extends the commonly-used $k$-means algorithm to work on a similarity matrix over complex data structures. The kernel $k$-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel $k$-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. In this paper, we are defining a family of kernel-based low-dimensional embeddings that allows for scaling kernel $k$-means on MapReduce via an efficient and unified parallelization strategy. Afterwards, we propose two methods for low-dimensional embedding that adhere to our definition of the embedding family. Exploiting the proposed parallelization strategy, we present two scalable MapReduce algorithms for kernel $k$-means. We demonstrate the effectiveness and efficiency of the proposed algorithms through an empirical evaluation on benchmark data sets.

[1] Christos Faloutsos,et al. Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[2] Jimeng Sun,et al. DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[3] W. Marsden. I and J , 2012 .

[4] Alexander J. Smola,et al. Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[5] Rong Jin,et al. Approximate kernel k-means: solution to large scale kernel clustering , 2011, KDD.

[6] Ameet Talwalkar,et al. Ensemble Nystrom Method , 2009, NIPS.

[7] Michael W. Mahoney,et al. Robust Regression on MapReduce , 2013, ICML.

[8] S. Janson. Stable distributions , 2011, 1112.0220.

[9] S DhillonInderjit,et al. Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007 .

[10] Inderjit S. Dhillon,et al. Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[11] Dong Xin,et al. Fast personalized PageRank on MapReduce , 2011, SIGMOD '11.

[12] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[13] Jimmy J. Lin,et al. Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[14] Piotr Indyk,et al. Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[15] Inderjit S. Dhillon,et al. Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] Sergei Vassilvitskii,et al. A model of computation for MapReduce , 2010, SODA '10.

[17] Benjamin Moseley,et al. Fast clustering using MapReduce , 2011, KDD.

[18] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[19] Ashraf Aboulnaga,et al. Scalable maximum clique computation using MapReduce , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[20] Matthias W. Seeger,et al. Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[21] Edward Y. Chang,et al. Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22] Rong Jin,et al. Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison , 2012, NIPS.

[23] Chao Liu,et al. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce , 2010, WWW '10.

[24] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[25] Jeremy Kubica,et al. Parallel Large Scale Feature Selection for Logistic Regression , 2009, SDM.

[26] Philipp Koehn,et al. Synthesis Lectures on Human Language Technologies , 2016 .

[27] Wael Abd-Almageed,et al. Distributed approximate spectral clustering for large-scale datasets , 2012, HPDC '12.

[28] Ulrike von Luxburg,et al. A tutorial on spectral clustering , 2007, Stat. Comput..

[29] Kristen Grauman,et al. Kernelized Locality-Sensitive Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30] Charalampos E. Tsourakakis,et al. HADI : Fast Diameter Estimation and Mining in Massive Graphs with Hadoop , 2008 .

[31] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32] Xinlei Chen,et al. Large Scale Spectral Clustering with Landmark-Based Representation , 2011, AAAI.

[33] Terence Sim,et al. The CMU Pose, Illumination, and Expression Database , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[34] Mohamed S. Kamel,et al. Distributed Column Subset Selection on MapReduce , 2013, 2013 IEEE 13th International Conference on Data Mining.

[35] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[36] Tom White,et al. Hadoop: The Definitive Guide , 2009 .

[37] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[38] Rong Jin,et al. Efficient Kernel Clustering Using Random Fourier Features , 2012, 2012 IEEE 12th International Conference on Data Mining.

[39] Randy H. Katz,et al. A view of cloud computing , 2010, CACM.

[40] Joydeep Ghosh,et al. Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..