论文信息 - EM*: An EM Algorithm for Big Data

EM*: An EM Algorithm for Big Data

Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like Expectation Maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a non-linear hierarchical data structure(heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real world and synthetic data sets. We lastly conclude with some theoretic underpinnings that explain why EM* is successful.

Mehmet M. Dalkilic | Mark Jenne | Hasan Kurban

[1] D. Rubin,et al. Parameter expansion to accelerate EM: The PX-EM algorithm , 1998 .

[2] Sergei Vassilvitskii,et al. Scalable K-Means++ , 2012, Proc. VLDB Endow..

[3] Charles Elkan,et al. Scalability for clustering algorithms revisited , 2000, SKDD.

[4] T. Moon. The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[5] Indranil Ghosh,et al. On the hidden truncated bivariate Pareto (IV) model and associated inferential issues , 2017 .

[6] G. C. Wei,et al. A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[7] Chuong B Do,et al. What is the expectation maximization algorithm? , 2008, Nature Biotechnology.

[8] Philip S. Yu,et al. Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[9] D. Rubin,et al. The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence , 1994 .

[10] G. McLachlan,et al. The EM algorithm and extensions , 1996 .

[11] Paul S. Bradley,et al. Scaling Clustering Algorithms to Large Databases , 1998, KDD.