Distrim: Parallel GMM learning on multicore cluster

Learning GMM model on extreme large data is challenging. We provide theoretical support for the feasibility of parallel EM-based GMM learning via distributed computing, and also design and implement a distributed memory sharing GMM learning system on multicore clusters, which is named as Distrim. Distrim aims to maximize the usage of computational power and minimize the communication overheads as much as possible. The experimental results show that Distrim is much more efficient than Hadoop, and also has a good scalability with respect to the number of computing nodes.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  K. Lange A gradient algorithm locally equivalent to the EM algorithm , 1995 .

[3]  D. N. Geary Mixture Models: Inference and Applications to Clustering , 1989 .

[4]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  R. Farnoosh,et al.  IMAGE SEGMENTATION USING GAUSSIAN MIXTURE MODEL , 2008 .

[9]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[10]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[11]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[12]  R Farnoush,et al.  Image Segmentation using Gaussian Mixture Model , 2008 .

[13]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[14]  Ian Buck,et al.  Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs Using CUDA , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[15]  J. Navarro-Pedreño Numerical Methods for Least Squares Problems , 1996 .

[16]  Pedro E. López-de-Teruel,et al.  The Parallel EM Algorithm and its Applications in Computer Vision , 1999, PDPTA.

[17]  Christian Böhm,et al.  Parallel EM-Clustering: Fast Convergence by Asynchronous Model Updates , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[18]  Aoying Zhou,et al.  Distributed Data Stream Clustering: A Fast EM-based Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[19]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[20]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .