Keep Your Host Language Object and Also Query it: A Case for SQL Query Support in RDBMS for Host Language Objects

As a result of prolific growth in data science and machine learning applications, modern relational database management systems (RDBMS) are experimenting with various approaches to facilitate advanced analytical computations, in addition to the relational operations that they traditionally support. The most common approach has been to integrate an embedded high level language (HLL) interpreter into the RDBMS along with any additional libraries that specialize in numerical computations. Such implementations, e.g., user defined functions (UDFs), follow generally a black-box setup, and for many complex workflows that require datasets to be passed and processed back-and-forth between the query execution engine and the embedded HLL interpreter, optimization opportunities are not fully explored yet. In this paper, we propose and implement the concept of virtual tables that can be used to expose data set objects maintained by the embedded HLL interpreter to the query engine for executing relational operations. Unlike prevalent solutions, our approach minimizes the need for performing data copies and conversions, performing them lazily when required. It also facilitates better optimization opportunities for the execution of SQL queries as the RDBMS is able to analyze the data characteristics of the HLL objects before producing an execution plan. The approach is also programmer friendly, allowing for a more intuitive implementation of computational workflows. We perform evaluations over a variety of workloads which demonstrate the performance and programming benefits of virtual tables.

[1]  Peter Dadam,et al.  Design and Implementation of an Extensible Database Management System Supporting User Defined Data Types and Functions , 1988, VLDB.

[2]  Mihai Varga Just-in-time compilation in MonetDB with Weld , 2018 .

[3]  Jun Yang,et al.  Data Management in Machine Learning: Challenges, Techniques, and Systems , 2017, SIGMOD Conference.

[4]  Wolfgang Lehner,et al.  Bridging two worlds with RICE , 2011, Proc. VLDB Endow..

[5]  Klemens Böhm,et al.  In-database analytics with ibmdbpy , 2018, SSDBM.

[6]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[7]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[8]  Bettina Kemme,et al.  AIDA - Abstraction for Advanced In-Database Analytics , 2018, Proc. VLDB Endow..

[9]  Stefan Manegold,et al.  Deep Integration of Machine Learning Into Column Stores , 2018, EDBT.

[10]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[11]  Philip A. Bernstein,et al.  Compiling mappings to bridge applications and databases , 2007, SIGMOD '07.

[12]  Hannes Mühleisen,et al.  Efficient data management and statistics with zero-copy integration , 2014, SSDBM '14.

[13]  Fabrice Marguerie,et al.  LINQ in Action , 2008 .

[14]  Saman P. Amarasinghe,et al.  A Common Runtime for High Performance Data Analysis , 2017, CIDR.

[15]  Wes McKinney,et al.  pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .

[16]  Hannes Mühleisen,et al.  Vectorized UDFs in Column-Stores , 2016, SSDBM.

[17]  Hannes Mühleisen,et al.  Best of both worlds: relational databases and statistics , 2013, SSDBM.

[18]  Regina Obe,et al.  PostgreSQL - Up and Running: a Practical Guide to the Advanced Open Source Database , 2012 .

[19]  Thomas W. Dinsmore In-Memory Analytics , 2016 .

[20]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.