Efficient Scalable Accurate Regression Queries in In-DBMS Analytics

Recent trends aim to incorporate advanced data analytics capabilities within DBMSs. Linear regression queries are fundamental to exploratory analytics and predictive modeling. However, computing their exact answers leaves a lot to be desired in terms of efficiency and scalability. We contribute a novel predictive analytics model and associated regression query processing algorithms, which are efficient, scalable and accurate. We focus on predicting the answers to two key query types that reveal dependencies between the values of different attributes: (i) mean-value queries and (ii) multivariate linear regression queries, both within specific data subspaces defined based on the values of other attributes. Our algorithms achieve many orders of magnitude improvement in query processing efficiency and nearperfect approximations of the underlying relationships among data attributes.

[1]  Samuel Madden,et al.  MauveDB: supporting model-based user views in database systems , 2006, SIGMOD Conference.

[2]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[3]  Karen A. F. Copeland Local Polynomial Modelling and its Applications , 1997 .

[4]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[5]  S. Grossberg Adaptive Resonance Theory , 2006 .

[6]  Stephen Grossberg,et al.  Adaptive Resonance Theory , 2010, Encyclopedia of Machine Learning.

[7]  H. Altay Güvenir,et al.  Clustered linear regression , 2002, Knowl. Based Syst..

[8]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[9]  Christopher Ré,et al.  Towards a unified architecture for in-RDBMS analytics , 2012, SIGMOD Conference.

[10]  Samuel Madden,et al.  Querying continuous functions in a database system , 2008, SIGMOD Conference.

[11]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[12]  H. H. Rosenbrock,et al.  An Automatic Method for Finding the Greatest or Least Value of a Function , 1960, Comput. J..

[13]  Michael Biehl,et al.  Adaptive Relevance Matrices in Learning Vector Quantization , 2009, Neural Computation.

[14]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[15]  Dan Olteanu,et al.  Learning Linear Regression Models over Factorized Joins , 2016, SIGMOD Conference.

[16]  Alexander Vergara,et al.  On the calibration of sensor arrays for pattern recognition using the minimal number of experiments , 2014 .

[17]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[18]  Zhao Zhang,et al.  Rethinking Data-Intensive Science Using Scalable Analytics Systems , 2015, SIGMOD Conference.