HADAD: A Lightweight Approach for Optimizing Hybrid Complex Analytics Queries

Hybrid complex analytics workloads typically include (i) data management tasks (joins, selections, etc. ), easily expressed using relational algebra (RA)-based languages, and (ii) complex analytics tasks (regressions, matrix decompositions, etc.), mostly expressed in linear algebra (LA) expressions. Such workloads are common in many application areas, including scientific computing, web analytics, and business recommendation. Existing solutions for evaluating hybrid analytical tasks - ranging from LA-oriented systems, to relational systems (extended to handle LA operations), to hybrid systems - either optimize data management and complex tasks separately, exploit RA properties only while leaving LA-specific optimization opportunities unexploited, or focus heavily on physical optimization, leaving semantic query optimization opportunities unexplored. Additionally, they are not able to exploit precomputed (materialized) results to avoid recomputing (part of) a given mixed (RA and/or LA) computation. In this paper, we take a major step towards filling this gap by proposing HADAD, an extensible lightweight approach for optimizing hybrid complex analytics queries, based on a common abstraction that facilitates unified reasoning: a relational model endowed with integrity constraints. Our solution can be naturally and portably applied on top of pure LA and hybrid RA-LA platforms without modifying their internals. An extensive empirical evaluation shows that HADAD yields significant performance gains on diverse workloads, ranging from LA-centered to hybrid.

[1]  Alin Deutsch,et al.  ESTOCADA: Towards Scalable Polystore Systems , 2020, Proc. VLDB Endow..

[2]  Dan Suciu,et al.  SPORES: Sum-Product Optimization via Relational Equality Saturation for Large Scale Linear Algebra , 2020, Proc. VLDB Endow..

[3]  Carlo Curino,et al.  Extending Relational Query Processing with ML Inference , 2019, CIDR.

[4]  Pablo Barceló,et al.  On the Expressiveness of LARA: A Unified Language for Linear and Relational Algebra , 2019, ICDT.

[5]  Tilmann Rabl,et al.  An Intermediate Representation for Optimizing Machine Learning Pipelines , 2019, Proc. VLDB Endow..

[6]  Ioana Manolescu,et al.  Towards Scalable Hybrid Stores: Constraint-Based Rewriting to the Rescue , 2019, SIGMOD Conference.

[7]  Peter J. Haas,et al.  MNC: Structure-Exploiting Sparsity Estimation for Matrix Expressions , 2019, SIGMOD Conference.

[8]  Patricia Valcárcel Fernández User , 2019, Dictionary of Statuses within EU Law.

[9]  Mir Mohammad Reza Alavi Milani,et al.  Rule-Based Production of Mathematical Expressions , 2018, Mathematics.

[10]  Arun Kumar,et al.  A Comparative Evaluation of Systems for Scalable Linear Algebra-based Analytics , 2018, Proc. VLDB Endow..

[11]  Berthold Reinwald,et al.  On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML , 2018, Proc. VLDB Endow..

[12]  Jan Van den Bussche,et al.  On the Expressive Power of Query Languages for Matrices , 2017, ICDT.

[13]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[14]  Joos-Hendrik Böse,et al.  Probabilistic Demand Forecasting at Scale , 2017, Proc. VLDB Endow..

[15]  Jun Yang,et al.  Data Management in Machine Learning: Challenges, Techniques, and Systems , 2017, SIGMOD Conference.

[16]  Michael N. Gubanov,et al.  Scalable Linear Algebra on a Relational Database System , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[17]  Jeffrey F. Naughton,et al.  Towards Linear Algebra over Normalized Data , 2016, Proc. VLDB Endow..

[18]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[19]  Berthold Reinwald,et al.  Declarative Machine Learning - A Classification of Basic Properties and Types , 2016, ArXiv.

[20]  Jeffrey F. Naughton,et al.  Model Selection Management Systems: The Next Frontier of Advanced Analytics , 2016, SGMD.

[21]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[22]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[23]  Tim Kraska,et al.  Automating model search for large scale machine learning , 2015, SoCC.

[24]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[25]  Ioana Ileana,et al.  Query rewriting using views : a theoretical and practical perspective. (Réécriture de requêtes avec des vues : une perspective théorique et pratique) , 2014 .

[26]  Alin Deutsch,et al.  Complete yet practical search for minimal query reformulations under constraints , 2014, SIGMOD Conference.

[27]  Wolfgang Lehner,et al.  Bringing Linear Algebra Objects to Life in a Column-Oriented In-Memory Database , 2013, IMDM@VLDB.

[28]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[29]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[30]  David R. Kincaid,et al.  Linear Algebra: Theory and Applications , 2010 .

[31]  Alin Deutsch,et al.  FOL Modeling of Integrity Constraints (Dependencies) , 2009, Encyclopedia of Database Systems.

[32]  Alin Deutsch,et al.  Query reformulation with constraints , 2006, SGMD.

[33]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[34]  S. Axler Linear Algebra Done Right , 1995, Undergraduate Texts in Mathematics.

[35]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[36]  Ashok K. Chandra,et al.  Optimal implementation of conjunctive queries in relational data bases , 1977, STOC '77.

[37]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .