Gaussian Mixture Models Use-Case: In-Memory Analysis with Myria

In our work with scientists, we find that Gaussian Mixture Modeling is a common type of analysis applied to increasingly large datasets. We implement this algorithm in the Myria shared-nothing relational data management system, which performs the computation in memory. We study resulting memory utilization challenges and implement several optimizations that yield an efficient and scalable solution. Empirical evaluations on large astronomy and oceanography datasets confirm that our Myria approach scales well and performs up to an order of magnitude faster than Hadoop.

[1]  D. Hunter,et al.  mixtools: An R Package for Analyzing Mixture Models , 2009 .

[2]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[3]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[4]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[5]  Dan Suciu,et al.  Demonstration of the Myria big data management service , 2014, SIGMOD Conference.

[6]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[7]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[8]  F. Ribalet,et al.  SeaFlow: A novel underway flow‐cytometer for continuous observations of phytoplankton in the ocean , 2011 .

[9]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Deborah Padgett Wide-field Infrared Survey Explorer , 2012 .

[12]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[13]  Daniel Halperin,et al.  Time-Varying Clusters in Large-Scale Flow Cytometry , 2015, AAAI.

[14]  David J. DeWitt,et al.  Weaving Relations for Cache Performance , 2001, VLDB.

[15]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[16]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..