KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics

Modern advanced analytics applications make use of machine learning techniques and contain multiple steps of domain-specific and general-purpose processing with high resource requirements. We present KeystoneML, a system that captures and optimizes the end-to-end large-scale machine learning applications for high-throughput training in a distributed environment with a high-level API. This approach offers increased ease of use and higher performance over existing systems for large scale learning. We demonstrate the effectiveness of KeystoneML in achieving high quality statistical accuracy and scalable training using real world datasets in several domains.

[1]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[2]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[3]  Prabhakar Raghavan,et al.  Memory Versus Randomization in On-line Algorithms (Extended Abstract) , 1989, ICALP.

[4]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[5]  Prabhakar Raghavan,et al.  Memory versus randomization in on-line algorithms , 1994, IBM J. Res. Dev..

[6]  Jeffrey F. Naughton,et al.  Query execution techniques for caching expensive methods , 1996, SIGMOD '96.

[7]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[8]  Surajit Chaudhuri,et al.  AutoAdmin “what-if” index analysis utility , 1998, SIGMOD '98.

[9]  Eddie Kohler,et al.  The Click modular router , 1999, SOSP.

[10]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[11]  Dan Klein,et al.  Optimization, Maxent Models, and Conditional Estimation without Magic , 2003, NAACL.

[12]  Hamid Pirahesh,et al.  Recommending materialized views and indexes with the IBM DB2 design advisor , 2004 .

[13]  Hamid Pirahesh,et al.  Recommending materialized views and indexes with the IBM DB2 design advisor , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[14]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[15]  R. Steele,et al.  Optimization , 2005, Encyclopedia of Biometrics.

[16]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[17]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[18]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[19]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[20]  Michael Stonebraker,et al.  Requirements for Science Data Bases and SciDB , 2009, CIDR.

[21]  RIOT: I/O-Efficient Numerical Computing without SQL , 2009, CIDR.

[22]  Alex Zelinsky,et al.  Learning OpenCV---Computer Vision with the OpenCV Library (Bradski, G.R. et al.; 2008)[On the Shelf] , 2009, IEEE Robotics & Automation Magazine.

[23]  Rada Chirkova,et al.  Materialized Views , 2012, Found. Trends Databases.

[24]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[25]  A. Krizhevsky Convolutional Deep Belief Networks on CIFAR-10 , 2010 .

[26]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[27]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[28]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[29]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[30]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[31]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[32]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[33]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[34]  Ashraf Aboulnaga,et al.  ReStore: reusing results of MapReduce jobs in pig , 2012, SIGMOD Conference.

[35]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[36]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[37]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[38]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[39]  Ashraf Aboulnaga,et al.  ReStore: Reusing Results of MapReduce Jobs , 2012, Proc. VLDB Endow..

[40]  Christopher Ré,et al.  Towards a unified architecture for in-RDBMS analytics , 2012, SIGMOD Conference.

[41]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[42]  Grey Ballard,et al.  Avoiding Communication in Dense Linear Algebra , 2013 .

[43]  Florin Rusu,et al.  Scalable I/O-bound parallel incremental gradient descent for big data analytics in GLADE , 2013, DanaC '13.

[44]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[45]  Christopher Ré,et al.  DimmWitted: A Study of Main-Memory Statistical Analytics , 2014, Proc. VLDB Endow..

[46]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[47]  Tim Kraska,et al.  Tupleware: Distributed Machine Learning on Small Clusters , 2014, IEEE Data Eng. Bull..

[48]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[49]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[50]  Luis Leopoldo Perez,et al.  History-aware query optimization with materialized intermediate views , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[51]  W. B. Roberts,et al.  Machine Learning: The High Interest Credit Card of Technical Debt , 2014 .

[52]  Luis Leopoldo Perez,et al.  A comparison of platforms for implementing and running very large scale machine learning algorithms , 2014, SIGMOD Conference.

[53]  Yann LeCun,et al.  Fast Training of Convolutional Networks through FFTs , 2013, ICLR.

[54]  Tara N. Sainath,et al.  Kernel methods match Deep Neural Networks on TIMIT , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Weizhu Chen,et al.  Large-scale L-BFGS using MapReduce , 2014, NIPS.

[56]  Christopher R'e,et al.  Caffe con Troll: Shallow Ideas to Speed Up Deep Learning , 2015, DanaC@SIGMOD.

[57]  Jure Leskovec,et al.  Inferring Networks of Substitutable and Complementary Products , 2015, KDD.

[58]  Tim Kraska,et al.  Automating model search for large scale machine learning , 2015, SoCC.

[59]  C. Ré,et al.  Materialization optimizations for feature selection workloads , 2014, SIGMOD Conference.

[60]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[62]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[63]  Haim Avron,et al.  High-Performance Kernel Machines With Implicit Distributed Optimization and Randomization , 2014, Technometrics.

[64]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[65]  Benjamin Recht,et al.  Flare Prediction Using Photospheric and Coronal Image Data , 2016, ArXiv.

[66]  R. Sarpong,et al.  Bio-inspired synthesis of xishacorenes A, B, and C, and a new congener from fuscol† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02572c , 2019, Chemical science.