TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries

The proliferation of massive datasets combined with the development of sophisticated analytical techniques have enabled a wide variety of novel applications such as improved product recommendations, automatic image tagging, and improved speech-driven interfaces. These and many other applications can be supported by Predictive Analytic Queries (PAQs). A major obstacle to supporting PAQs is the challenging and expensive process of identifying and training an appropriate predictive model. Recent efforts aiming to automate this process have focused on single node implementations and have assumed that model training itself is a black box, thus limiting the effectiveness of such approaches on large-scale problems. In this work, we build upon these recent efforts and propose an integrated PAQ planning architecture that combines advanced model search techniques, bandit resource allocation via runtime algorithm introspection, and physical optimization via batching. The result is TuPAQ, a component of the MLbase system, which solves the PAQ planning problem with comparable quality to exhaustive strategies but an order of magnitude more efficiently than the standard baseline approach, and can scale to models trained on terabytes of data across hundreds of machines.

[1]  M. J. D. Powell,et al.  An efficient method for finding the minimum of a function of several variables without calculating derivatives , 1964, Comput. J..

[2]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[3]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[4]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[5]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[6]  Samuel Madden,et al.  MauveDB: supporting model-based user views in database systems , 2006, SIGMOD Conference.

[7]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[8]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[9]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[10]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[11]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[12]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[13]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[14]  John L. Nazareth,et al.  Introduction to derivative-free optimization , 2010, Math. Comput..

[15]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  Ameet Talwalkar,et al.  Divide-and-Conquer Matrix Factorization , 2011, NIPS.

[18]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[19]  Tara N. Sainath,et al.  Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[21]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[22]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[23]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[24]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[25]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[26]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[27]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[28]  Jonathan Krause,et al.  Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[30]  Christopher Ré,et al.  Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System , 2013, Proc. VLDB Endow..

[31]  John F. Canny,et al.  Big data analytics with small footprint: squaring the cloud , 2013, KDD.

[32]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[33]  Tim Kraska,et al.  MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.

[34]  Andreas Krause,et al.  Near-optimal Batch Mode Active Learning and Adaptive Submodular Optimization , 2013, ICML.

[35]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[36]  Christopher Ré,et al.  DimmWitted: A Study of Main-Memory Statistical Analytics , 2014, Proc. VLDB Endow..

[37]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[38]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[39]  Tara N. Sainath,et al.  Kernel methods match Deep Neural Networks on TIMIT , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  P. Haas MCDB and SimSQL : Scalable Stochastic Analytics within the Database , 2014 .

[41]  Max Kuhn,et al.  caret: Classification and Regression Training , 2015 .

[42]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.