Efficient Construction of Approximate Ad-Hoc ML models Through Materialization and Reuse

Machine learning has become an essential toolkit for complex analytic processing. Data is typically stored in large data warehouses with multiple dimension hierarchies. Often, data used for building an ML model are aligned on OLAP hierarchies such as location or time. In this paper, we investigate the feasibility of efficiently constructing approximate ML models for new queries from previously constructed ML models by leveraging the concepts of model materialization and reuse. For example, is it possible to construct an approximate ML model for data from the year 2017 if one already has ML models for each of its quarters? We propose algorithms that can support a wide variety of ML models such as generalized linear models for classification along with K-Means and Gaussian Mixture models for clustering. We propose a cost based optimization framework that identifies appropriate ML models to combine at query time and conduct extensive experiments on real-world and synthetic datasets. Our results indicate that our framework can support analytic queries on ML models, with superior performance, achieving dramatic speedups of several orders in magnitude on very large datasets. PVLDB Reference Format: Sona Hasani, Saravanan Thirumuruganathan, Abolfazl Asudeh, Nick Koudas and Gautam Das. Efficient Construction of Approximate Ad-Hoc ML models Through Materialization and Reuse. PVLDB, 11 (11): 1468-1481, 2018. DOI: https://doi.org/10.14778/3236187.3236199

[1]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[2]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[3]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[4]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[5]  Christos Faloutsos,et al.  NetCube: A Scalable Tool for Fast Data Mining and Compression , 2001, VLDB.

[6]  Larry S. Davis,et al.  ModelHub: Deep Learning Lifecycle Management , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[7]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[8]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[9]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[10]  Christian Hennig,et al.  Methods for merging Gaussian mixture components , 2010, Adv. Data Anal. Classif..

[11]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[12]  Bin Cui,et al.  MLog: Towards Declarative In-Database Machine Learning , 2017, Proc. VLDB Endow..

[13]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[14]  Alon Y. Halevy,et al.  Goods: Organizing Google's Datasets , 2016, SIGMOD Conference.

[15]  Adrian E. Raftery,et al.  Computing Normalizing Constants for Finite Mixture Models via Incremental Mixture Importance Sampling (IMIS) , 2006 .

[16]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[17]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[18]  Wilko Schwarting,et al.  Training Support Vector Machines using Coresets , 2017, ArXiv.

[19]  Jun Yang,et al.  Data Management in Machine Learning: Challenges, Techniques, and Systems , 2017, SIGMOD Conference.

[20]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[21]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[22]  Andreas Krause,et al.  Scalable and Distributed Clustering via Lightweight Coresets , 2017, ArXiv.

[23]  Adam Meyerson,et al.  Fast and Accurate k-means For Large Datasets , 2011, NIPS.

[24]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[25]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[26]  Xintao Wu,et al.  Loglinear-Based Quasi Cubes , 2004, Journal of Intelligent Information Systems.

[27]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[28]  Joseph Gonzalez,et al.  Hemingway: Modeling Distributed Optimization Algorithms , 2017, ArXiv.

[29]  A.R. Runnalls,et al.  A Kullback-Leibler Approach to Gaussian Mixture Reduction , 2007 .

[30]  Tong Zhang,et al.  Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Regression , 2016, The Annals of Statistics.

[31]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[32]  Yi Lin,et al.  Prediction Cubes , 2005, VLDB.

[33]  Kristian Kersting,et al.  Coreset based Dependency Networks , 2017 .

[34]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[35]  Ryan Johnson,et al.  Processing Analytical Workloads Incrementally , 2015, ArXiv.

[36]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[37]  Trevor Campbell,et al.  Coresets for Scalable Bayesian Logistic Regression , 2016, NIPS.

[38]  Jian Pei,et al.  Mining Multi-Dimensional Constrained Gradients in Data Cubes , 2001, VLDB.

[39]  Gavriel Salomon,et al.  T RANSFER OF LEARNING , 1992 .

[40]  Larry S. Davis,et al.  Towards Unified Data and Lifecycle Management for Deep Learning , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[41]  Christopher Ré,et al.  Materialization optimizations for feature selection workloads , 2014, SIGMOD Conference.

[42]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[43]  Peter Triantafillou,et al.  Efficient Scalable Accurate Regression Queries in In-DBMS Analytics , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[44]  Manasi Vartak,et al.  ModelDB: a system for machine learning model management , 2016, HILDA '16.

[45]  Samuel Madden,et al.  MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis , 2018, SIGMOD Conference.

[46]  Jeffrey F. Naughton,et al.  Learning Generalized Linear Models Over Normalized Data , 2015, SIGMOD Conference.

[47]  Jeff M. Phillips,et al.  Improved Coresets for Kernel Density Estimates , 2017, SODA.

[48]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[49]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[50]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[51]  Jeffrey F. Naughton,et al.  Model Selection Management Systems: The Next Frontier of Advanced Analytics , 2016, SGMD.

[52]  Sanjay Chawla,et al.  A Cost-based Optimizer for Gradient Descent Optimization , 2017, SIGMOD Conference.

[53]  Jeffrey F. Naughton,et al.  To Join or Not to Join?: Thinking Twice about Joins before Feature Selection , 2016, SIGMOD Conference.

[54]  Justus H. Piater,et al.  Online Learning of Gaussian Mixture Models - a Two-Level Approach , 2008, VISAPP.