Big data analytics with small footprint: squaring the cloud

This paper describes the BID Data Suite, a collection of hardware, software and design patterns that enable fast, large-scale data mining at very low cost. By co-designing all of these elements we achieve single-machine performance levels that equal or exceed reported cluster implementations for common benchmark problems. A key design criterion is rapid exploration of models, hence the system is interactive and primarily single-user. The elements of the suite are: (i) the data engine, a hardware design pattern that balances storage, CPU and GPU acceleration for typical data mining workloads, (ii) BIDMat, an interactive matrix library that integrates CPU and GPU acceleration and novel computational kernels (iii), BIDMach, a machine learning system that includes very efficient model optimizers, (iv) Butterfly mixing, a communication strategy that hides the latency of frequent model updates needed by fast optimizers and (v) Design patterns to improve performance of iterative update algorithms. We present several benchmark problems to show how the above elements combine to yield multiple orders-of-magnitude improvements for each problem.

[1]  John F. Canny,et al.  Collaborative filtering with privacy via factor analysis , 2002, SIGIR '02.

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  John F. Canny,et al.  GaP: a factor model for discrete data , 2004, SIGIR '04.

[4]  Yehuda Koren,et al.  Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[5]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[6]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[7]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[8]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[9]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[10]  Markus Weimer,et al.  Machine learning in ScalOps , a higher order cloud computing language , 2011 .

[11]  Alin Deutsch,et al.  ASTERIX: towards a scalable, semistructured data platform for evolving-world models , 2011, Distributed and Parallel Databases.

[12]  M. J. Laan,et al.  Targeted Learning: Causal Inference for Observational and Experimental Data , 2011 .

[13]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[14]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[15]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[16]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[17]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[18]  John F. Canny,et al.  Butterfly Mixing: Accelerating Incremental-Update Algorithms on Clusters , 2013, SDM.

[19]  Christopher Ré,et al.  Parallel stochastic gradient algorithms for large-scale matrix completion , 2013, Mathematical Programming Computation.

[20]  Patrick Seemann,et al.  Matrix Factorization Techniques for Recommender Systems , 2014 .