A Parallel EM Algorithm for Model-Based Clustering Applied to the Exploration of Large Spatio-Temporal Data

We develop a parallel expectation–maximization (EM) algorithm for multivariate Gaussian mixture models and use it to perform model-based clustering of a large climate dataset. Three variants of the EM algorithm are reformulated in parallel and a new variant that is faster is presented. All are implemented using the single program, multiple data programming model, which is able to take advantage of the combined collective memory of large distributed computer architectures to process larger datasets. Displays of the estimated mixture model rather than the data allow us to explore multivariate relationships in a way that scales to arbitrary size data. We study the performance of our methodology on simulated data and apply our methodology to a high-resolution climate dataset produced by the community atmosphere model (CAM5). This article has supplementary material online.

[1]  Wojciech Kwedlo,et al.  A New Method for Random Initialization of the EM Algorithm for Multivariate Gaussian Mixture Learning , 2013, CORES.

[2]  Wei-Chen Chen,et al.  MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms , 2012 .

[3]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[4]  Volodymyr Melnykov,et al.  Finite mixture models and model-based clustering , 2010 .

[5]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[6]  George Ostrouchov,et al.  Programming with Big Data – Demonstrations of pbd Packages , 2014 .

[7]  Michael F. Wehner,et al.  Response of precipitation extremes to idealized global warming in an aqua-planet climate model: towards a robust projection across different horizontal resolutions , 2011 .

[8]  M. Cugmas,et al.  On comparing partitions , 2015 .

[9]  Cliburn Chan,et al.  Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[10]  Gérard Govaert,et al.  Model-based cluster and discriminant analysis with the MIXMOD software , 2006, Comput. Stat. Data Anal..

[11]  Frederica Darema,et al.  The SPMD Model : Past, Present and Future , 2001, PVM/MPI.

[12]  Mikhail J. Atallah,et al.  Algorithms and Theory of Computation Handbook , 2009, Chapman & Hall/CRC Applied Algorithms and Data Structures series.

[13]  Xiao-Li Meng,et al.  The EM Algorithm—an Old Folk‐song Sung to a Fast New Tune , 1997 .

[14]  Wei-Chen Chen,et al.  Model‐based clustering of regression time series data via APECM—an AECM algorithm sung to an even faster beat , 2011, Stat. Anal. Data Min..

[15]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[16]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[17]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[18]  J. V. Revadekar,et al.  Global observed changes in daily climate extremes of temperature and precipitation , 2006 .

[19]  Volodymyr Melnykov,et al.  Efficient estimation in model‐based clustering of Gaussian regression time series , 2012, Stat. Anal. Data Min..

[20]  W. Collins,et al.  Impact of horizontal resolution on simulation of precipitation extremes in an aqua-planet version of Community Atmospheric Model (CAM3) , 2011 .

[21]  R. Maitra,et al.  Initializing Partition-Optimization Algorithms , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Evan M. Manning,et al.  Massive Dataset Analysis for NASA’s Atmospheric Infrared Sounder , 2012, Technometrics.

[23]  M. Haylock,et al.  Observed coherent changes in climatic extremes during the second half of the twentieth century , 2002 .

[24]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  Paul D. McNicholas,et al.  Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models , 2010, Comput. Stat. Data Anal..