Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce

The kernel $k$-means is an effective method for data clustering which extends the commonly-used $k$-means algorithm to work on a similarity matrix over complex data structures. The kernel $k$-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel $k$-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. In this paper, we are defining a family of kernel-based low-dimensional embeddings that allows for scaling kernel $k$-means on MapReduce via an efficient and unified parallelization strategy. Afterwards, we propose two methods for low-dimensional embedding that adhere to our definition of the embedding family. Exploiting the proposed parallelization strategy, we present two scalable MapReduce algorithms for kernel $k$-means. We demonstrate the effectiveness and efficiency of the proposed algorithms through an empirical evaluation on benchmark data sets.

[1]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[2]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[3]  W. Marsden I and J , 2012 .

[4]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[5]  Rong Jin,et al.  Approximate kernel k-means: solution to large scale kernel clustering , 2011, KDD.

[6]  Ameet Talwalkar,et al.  Ensemble Nystrom Method , 2009, NIPS.

[7]  Michael W. Mahoney,et al.  Robust Regression on MapReduce , 2013, ICML.

[8]  S. Janson Stable distributions , 2011, 1112.0220.

[9]  S DhillonInderjit,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007 .

[10]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[11]  Dong Xin,et al.  Fast personalized PageRank on MapReduce , 2011, SIGMOD '11.

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[14]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[15]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[17]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[18]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[19]  Ashraf Aboulnaga,et al.  Scalable maximum clique computation using MapReduce , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[20]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[21]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Rong Jin,et al.  Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison , 2012, NIPS.

[23]  Chao Liu,et al.  Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce , 2010, WWW '10.

[24]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[25]  Jeremy Kubica,et al.  Parallel Large Scale Feature Selection for Logistic Regression , 2009, SDM.

[26]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[27]  Wael Abd-Almageed,et al.  Distributed approximate spectral clustering for large-scale datasets , 2012, HPDC '12.

[28]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[29]  Kristen Grauman,et al.  Kernelized Locality-Sensitive Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Charalampos E. Tsourakakis,et al.  HADI : Fast Diameter Estimation and Mining in Massive Graphs with Hadoop , 2008 .

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Xinlei Chen,et al.  Large Scale Spectral Clustering with Landmark-Based Representation , 2011, AAAI.

[33]  Terence Sim,et al.  The CMU Pose, Illumination, and Expression Database , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Mohamed S. Kamel,et al.  Distributed Column Subset Selection on MapReduce , 2013, 2013 IEEE 13th International Conference on Data Mining.

[35]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[36]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[37]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[38]  Rong Jin,et al.  Efficient Kernel Clustering Using Random Fourier Features , 2012, 2012 IEEE 12th International Conference on Data Mining.

[39]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[40]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..