Efficient sample generation for scalable meta learning

Meta learning techniques such as cross-validation and ensemble learning are crucial for applying machine learning to real-world use cases. These techniques first generate samples from input data, and then train and evaluate machine learning models on these samples. For meta learning on large datasets, the efficient generation of samples becomes problematic, especially when the data is stored distributed in a block-partitioned representation, and processed on a shared-nothing cluster. We present a novel, parallel algorithm for efficient sample generation from large, block-partitioned datasets in a shared-nothing architecture. This algorithm executes in a single pass over the data, and minimizes inter-machine communication. The algorithm supports a wide variety of sample generation techniques through an embedded user-defined sampling function. We illustrate how to implement distributed sample generation for popular meta learning techniques such as hold-out tests, k-fold cross-validation, and bagging, using our algorithm and present an experimental evaluation on datasets with billions of datapoints.

[1]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[2]  Volker Markl,et al.  Myriad: Scalable and Expressive Data Generation , 2012, Proc. VLDB Endow..

[3]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[4]  Shirish Tatikonda,et al.  Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML , 2014, Proc. VLDB Endow..

[5]  C. S. Davis The computer generation of multinomial random variates , 1993 .

[6]  Yuan Yuan,et al.  Major technical advancements in apache hive , 2014, SIGMOD Conference.

[7]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[8]  Tim Kraska,et al.  MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.

[9]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[10]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[11]  P. Burman A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[12]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[13]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[14]  Shivnath Babu,et al.  Cumulon: optimizing statistical data analysis in the cloud , 2013, SIGMOD '13.

[15]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[16]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[17]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[18]  Michael J. Carey,et al.  Extending Map-Reduce for Efficient Predicate-Based Sampling , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[19]  Jimmy J. Lin,et al.  Large-scale machine learning at twitter , 2012, SIGMOD Conference.

[20]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[21]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[22]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[23]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[24]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[25]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[26]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[27]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.