Accelerating distributed Expectation-Maximization algorithms with frequent updates

Abstract Expectation–Maximization (EM) is a popular approach for parameter estimation in many applications, such as image understanding, document classification, and genome data analysis. Despite the popularity of EM algorithms, it is challenging to efficiently implement these algorithms in a distributed environment for handling massive data sets. In particular, many EM algorithms that frequently update the parameters have been shown to be much more efficient than their concurrent counterparts. Accordingly, we propose two approaches to parallelize such EM algorithms in a distributed environment so as to scale to massive data sets. We prove that both approaches maintain the convergence properties of the EM algorithms. Based on the approaches, we design and implement a distributed framework, FreEM, to support the implementation of frequent updates for the EM algorithms. We show its efficiency through two categories of EM applications, clustering and topic modeling. These applications include k-means clustering, fuzzy c-means clustering, parameter estimation for the Gaussian Mixture Model, and variational inference for Latent Dirichlet Allocation. We extensively evaluate our framework on both a cluster of local machines and the Amazon EC2 cloud. Our evaluation shows that the EM algorithms with frequent updates implemented on FreEM can converge much faster than those implementations with traditional concurrent updates.

[1]  Lixin Gao,et al.  Scalable Nonnegative Matrix Factorization with Block-wise Updates , 2014, ECML/PKDD.

[2]  Yanfeng Zhang,et al.  iMapReduce: A Distributed Computing Framework for Iterative Computation , 2011, Journal of Grid Computing.

[3]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[4]  Henggang Cui Parallel Implementation of Expectation-Maximization for Fast Convergence , 2012 .

[5]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Hendrik T. Macedo,et al.  Parallel Implementation of Expectation-maximisation Algorithm for the Training of Gaussian Mixture Models , 2014, J. Comput. Sci..

[8]  Christian P. Robert,et al.  Monte Carlo Statistical Methods (Springer Texts in Statistics) , 2005 .

[9]  Dan Klein,et al.  Fully distributed EM for very large datasets , 2008, ICML '08.

[10]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[11]  Bo Thiesson,et al.  Accelerating EM for Large Databases , 2001, Machine Learning.

[12]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[13]  Lixin Gao,et al.  Accelerating Expectation-Maximization Algorithms with Frequent Updates , 2012, 2012 IEEE International Conference on Cluster Computing.

[14]  Jordan L. Boyd-Graber,et al.  Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce , 2012, WWW.

[15]  Lixin Gao,et al.  Asynchronous Distributed Incremental Computation on Evolving Graphs , 2016, ECML/PKDD.

[16]  Yanfeng Zhang,et al.  PrIter: A Distributed Framework for Prioritizing Iterative Computations , 2011, IEEE Transactions on Parallel and Distributed Systems.

[17]  Yueting Zhuang,et al.  Probabilistic Word Selection via Topic Modeling , 2015, IEEE Transactions on Knowledge and Data Engineering.

[18]  Giancarlo Fortino,et al.  Fault tolerant decentralised K-Means clustering for asynchronous large-scale networks , 2013, J. Parallel Distributed Comput..

[19]  Christian Böhm,et al.  Parallel EM-Clustering: Fast Convergence by Asynchronous Model Updates , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[20]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[21]  Lixin Gao,et al.  Scalable Distributed Belief Propagation with Prioritized Block Updates , 2014, CIKM.

[22]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Rong Gu,et al.  SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters , 2014, J. Parallel Distributed Comput..

[25]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[26]  Lixin Gao,et al.  Co-ClusterD: A Distributed Framework for Data Co-Clustering with Sequential Updates , 2015, IEEE Transactions on Knowledge and Data Engineering.

[27]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Lixin Gao,et al.  GOM-Hadoop: A distributed framework for efficient analytics on ordered datasets , 2015, J. Parallel Distributed Comput..