EM*: An EM Algorithm for Big Data

Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like Expectation Maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a non-linear hierarchical data structure(heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real world and synthetic data sets. We lastly conclude with some theoretic underpinnings that explain why EM* is successful.

[1]  D. Rubin,et al.  Parameter expansion to accelerate EM: The PX-EM algorithm , 1998 .

[2]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[3]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[4]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[5]  Indranil Ghosh,et al.  On the hidden truncated bivariate Pareto (IV) model and associated inferential issues , 2017 .

[6]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[7]  Chuong B Do,et al.  What is the expectation maximization algorithm? , 2008, Nature Biotechnology.

[8]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[9]  D. Rubin,et al.  The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence , 1994 .

[10]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[11]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[12]  Min Soo Kang,et al.  Clustering performance comparison using K-means and expectation maximization algorithms , 2014, Biotechnology, biotechnological equipment.

[13]  Xiao-Li Meng,et al.  The EM Algorithm—an Old Folk‐song Sung to a Fast New Tune , 1997 .

[14]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[15]  Kazunori Yamaguchi,et al.  The EM algorithm and related statistical models , 2003 .

[16]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[17]  Gilles Celeux,et al.  On Stochastic Versions of the EM Algorithm , 1995 .

[18]  W. Marsden I and J , 2012 .

[19]  Mehmet M. Dalkilic,et al.  Studying the Milky Way Galaxy Using ParaHeap-k , 2014, Computer.

[20]  J. Booth,et al.  Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm , 1999 .

[21]  Carlos Ordonez,et al.  FREM: fast and robust EM clustering for large data sets , 2002, CIKM '02.

[22]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[23]  E. Hatziminaoglou,et al.  Star counts in the Galaxy - Simulating from very deep to very shallow photometric surveys with the TRILEGAL code , 2005, astro-ph/0504047.

[24]  Alan L. Yuille,et al.  Statistical Physics, Mixtures of Distributions, and the EM Algorithm , 1994, Neural Computation.

[25]  Arnab Chakraborty,et al.  Use of EM algorithm for data reduction under sparsity assumption , 2017, Comput. Stat..

[26]  Michael I. Jordan,et al.  On Convergence Properties of the EM Algorithm for Gaussian Mixtures , 1996, Neural Computation.

[27]  Jun S. Liu,et al.  Parameter Expansion for Data Augmentation , 1999 .

[28]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[29]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.