AIDA - Abstraction for Advanced In-Database Analytics

With the tremendous growth in data science and machine learning, it has become increasingly clear that traditional relational database management systems (RDBMS) are lacking appropriate support for the programming paradigms required by such applications, whose developers prefer tools that perform the computation outside the database system. While the database community has attempted to integrate some of these tools in the RDBMS, this has not swayed the trend as existing solutions are often not convenient for the incremental, iterative development approach used in these fields. In this paper, we propose AIDA an abstraction for advanced in-database analytics. AIDA emulates the syntax and semantics of popular data science packages but transparently executes the required transformations and computations inside the RDBMS. In particular, AIDA works with a regular Python interpreter as a client to connect to the database. Furthermore, it supports the seamless use of both relational and linear algebra operations using a unified abstraction. AIDA relies on the RDBMS engine to efficiently execute relational operations and on an embedded Python interpreter and NumPy to perform linear algebra operations. Data reformatting is done transparently and avoids data copy whenever possible. AIDA does not require changes to statistical packages or the RDBMS facilitating portability. PVLDB Reference Format: Joseph Vinish D’silva, Florestan De Moor, Bettina Kemme. AIDA Abstraction for Advanced In-Database Analytics. PVLDB, 11(11): 1400-1413, 2018. DOI: https://doi.org/10.14778/3236187.3236194

[1]  Alan R. Simon,et al.  Sql: 1999 Understanding Relational Language Components , 2002 .

[2]  Kunle Olukotun,et al.  LevelHeaded: A Unified Engine for Business Intelligence and Linear Algebra Querying , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[3]  Hannes Mühleisen,et al.  Best of both worlds: relational databases and statistics , 2013, SSDBM.

[4]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[5]  Hannes Mühleisen,et al.  Efficient data management and statistics with zero-copy integration , 2014, SSDBM '14.

[6]  Kevin Wilkinson,et al.  Data integration flows for business intelligence , 2009, EDBT '09.

[7]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[8]  Gang Chen,et al.  Database Meets Deep Learning: Challenges and Opportunities , 2016, SGMD.

[9]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[10]  Lavanya Ramakrishnan,et al.  Evaluation of NoSQL and Array Databases for Scientific Applications , 2013 .

[11]  Andrew Kuchling The Python DB-API , 1998 .

[12]  Ying Zhang,et al.  SciQL: array data processing inside an RDBMS , 2013, SIGMOD '13.

[13]  T. Miller Using R and Python in the Teradata Database , 2016 .

[14]  Michael Stal,et al.  An architectural view of distributed objects and components in CORBA, Java RMI and COM/DCOM , 1998, Softw. Concepts Tools.

[15]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[16]  Alvin Cheung,et al.  Comparative Evaluation of Big-Data Systems on Scientific Image Analytics Workloads , 2016, Proc. VLDB Endow..

[17]  Kunle Olukotun,et al.  Mind the Gap: Bridging Multi-Domain Query Workloads with EmptyHeaded , 2017, Proc. VLDB Endow..

[18]  嘉一 鷲沢,et al.  GNU Octave(前編) , 2011 .

[19]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[20]  Wes McKinney,et al.  pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .

[21]  Hannes Mühleisen,et al.  Vectorized UDFs in Column-Stores , 2016, SSDBM.

[22]  Hannes Mühleisen,et al.  Don't Hold My Data Hostage - A Case For Client Protocol Redesign , 2017, Proc. VLDB Endow..

[23]  Kenneth Salem,et al.  Query processing techniques for arrays , 1999, SIGMOD '99.

[24]  Wolfgang Lehner,et al.  Bridging two worlds with RICE , 2011, Proc. VLDB Endow..

[25]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[26]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[27]  Weiping Zhang,et al.  I/O-efficient statistical computing with RIOT , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[28]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[29]  Martin Kersten,et al.  Don’t Hold My UDFs Hostage - Exporting UDFs For Debugging Purposes , 2018 .

[30]  Andreas Heuer,et al.  A framework for self-managing database support and parallel computing for assistive systems , 2015, PETRA.

[31]  Elvis C. Foster,et al.  Database Systems: A Pragmatic Approach , 2014 .

[32]  Jun Yang,et al.  Data Management in Machine Learning: Challenges, Techniques, and Systems , 2017, SIGMOD Conference.

[33]  Christopher Ré,et al.  Towards a unified architecture for in-RDBMS analytics , 2012, SIGMOD Conference.

[34]  Aruna Raja,et al.  Domain Specific Languages , 2010 .

[35]  T. Vincenty DIRECT AND INVERSE SOLUTIONS OF GEODESICS ON THE ELLIPSOID WITH APPLICATION OF NESTED EQUATIONS , 1975 .

[36]  Michael Stonebraker,et al.  SciDB: A Database Management System for Applications with Complex Analytics , 2013, Computing in Science & Engineering.

[37]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[38]  Michael M. McKerns,et al.  Building a Framework for Predictive Science , 2012, SciPy.

[39]  Philip A. Bernstein,et al.  Compiling mappings to bridge applications and databases , 2007, SIGMOD '07.

[40]  Michael N. Gubanov,et al.  Scalable Linear Algebra on a Relational Database System , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[41]  N. Diakopoulos,et al.  Data-Driven Rankings : The Design and Development of the IEEE Top Programming Languages News App , 2014 .

[42]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[43]  Peter Dadam,et al.  Design and Implementation of an Extensible Database Management System Supporting User Defined Data Types and Functions , 1988, VLDB.