Scalable Linear Algebra on a Relational Database System

As data analytics has become an important application for modern data management systems, a new category of data management system has appeared recently: the scalable linear algebra system. In this paper, we argue that a parallel or distributed database system is actually an excellent platform upon which to build such functionality. Most relational systems already have support for cost-based optimization—which is vital to scaling linear algebra computations—and it is well-known how to make relational systems scale. We show that by making just a few changes to a parallel/ distributed relational database system, such a system can be a competitive platform for scalable linear algebra. Taken together, our results should at least raise the possibility that brand new systems designed from the ground up to support scalable linear algebra are not absolutely necessary, and that such systems could instead be built on top of existing relational technology. Our results also suggest that if scalable linear algebra is to be added to a modern dataflow platform such as Spark, they should be added on top of the system's more structured (relational) data abstractions, rather than being constructed directly on top of the system's raw dataflow operators.

[1]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[2]  Limsoon Wong,et al.  A query language for multidimensional arrays: design, implementation, and optimization techniques , 1996, SIGMOD '96.

[3]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[4]  Shirish Tatikonda,et al.  SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs , 2014, IEEE Data Eng. Bull..

[5]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[6]  Carlos Maltzahn,et al.  SciHadoop: Array-based query processing in Hadoop , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Frederick Reiss,et al.  Compressed linear algebra for large-scale machine learning , 2016, The VLDB Journal.

[8]  Tilmann Rabl,et al.  BlockJoin: Efficient Matrix Partitioning Through Joins , 2017, Proc. VLDB Endow..

[9]  Marcin Zukowski,et al.  MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[10]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[11]  Jin-Soo Kim,et al.  HAMA: An Efficient Matrix Computation with the MapReduce Framework , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[12]  Weiping Zhang,et al.  I/O-efficient statistical computing with RIOT , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[13]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[14]  Peter J. Haas,et al.  Simulation of database-valued markov chains using SimSQL , 2013, SIGMOD '13.

[15]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[16]  Guy Lebanon,et al.  Metric learning for text documents , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[18]  Ying Zhang,et al.  SciQL: array data processing inside an RDBMS , 2013, SIGMOD '13.

[19]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[20]  Carlos Ordonez,et al.  Statistical Model Computation with UDFs , 2010, IEEE Transactions on Knowledge and Data Engineering.

[21]  Peter J. Haas,et al.  Ricardo: integrating R and Hadoop , 2010, SIGMOD Conference.

[22]  J. Demmel,et al.  Sun Microsystems , 1996 .

[23]  Zhengping Qian,et al.  MadLINQ: large-scale distributed matrix computation for the cloud , 2012, EuroSys '12.

[24]  Shivnath Babu,et al.  Cumulon: optimizing statistical data analysis in the cloud , 2013, SIGMOD '13.

[25]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..