Scalable Linear Algebra Programming for Big Data Analysis

Arrays are very important data structures for many data-centric and scientific applications. One of the most effective representations of large dense arrays in a distributed setting is a block array, such as a tiled matrix, which is a distributed collection of non-overlapping dense array blocks. Although there are many linear algebra libraries for machine learning that support distributed block arrays and provide an optimal implementation for many array operations, these libraries do not support ad-hoc array programming and customized storage structures. Imperative programs with loops and array indexing, on the other hand, are more powerful as they allow arbitrary array computations but are hard to parallelize and convert to distributed programs. Our goal is to provide an SQL-like abstraction for data-parallel distributed array computations that is expressive enough to capture a large class of array computations and can be compiled to efficient data-parallel distributed code. Our abstraction is a monolithic array construction in the form of an array comprehension that is as expressive as SQL by supporting a group-by syntax that allows us to capture many array computations in declarative form. We present rules for translating array comprehensions on block arrays to data-parallel distributed code that can run on Apache Spark. We describe a comprehensive set of effective optimizations that can produce very efficient translations, such as the optimal block matrix multiplication algorithm, even though they are oblivious to linear algebra operations. Finally, we justify our claims by evaluating the performance of our generated code on Apache Spark relative to Spark MLlib.

[1]  Shirish Tatikonda,et al.  SystemML: Declarative Machine Learning on Spark , 2016, Proc. VLDB Endow..

[2]  Carlos Maltzahn,et al.  SciHadoop: Array-based query processing in Hadoop , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[4]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[5]  Guangwen Yang,et al.  SciHive: Array-Based Query Processing with HiveQL , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[6]  Michael N. Gubanov,et al.  Scalable Linear Algebra on a Relational Database System , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[7]  Matei Zaharia,et al.  Matrix Computations and Optimization in Apache Spark , 2015, KDD.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Leonidas Fegaras,et al.  Compile-Time Code Generation for Embedded Data-Intensive Query Languages , 2018, 2018 IEEE International Congress on Big Data (BigData Congress).

[10]  Leonidas Fegaras,et al.  Translation of array-based loops to distributed data-parallel programs , 2020, Proc. VLDB Endow..

[11]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[12]  Simon Peyton Jones,et al.  Comprehensive Comprehensions Comprehensions with 'Order by' and 'Group by' , 2007 .

[13]  Yi Wang,et al.  SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[14]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[15]  Martin Odersky,et al.  A Generic Parallel Collection Framework , 2011, Euro-Par.

[16]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[17]  Magdalena Balazinska,et al.  ArrayStore: a storage manager for complex parallel array processing , 2011, SIGMOD '11.

[18]  Magdalena Balazinska,et al.  Efficient iterative processing in the SciDB parallel array engine , 2015, SSDBM.

[19]  Stavros Papadopoulos,et al.  The TileDB Array Data Storage Manager , 2016, Proc. VLDB Endow..

[20]  Leonidas Fegaras,et al.  An algebra for distributed Big Data analytics , 2017, Journal of Functional Programming.

[21]  Jeffrey F. Naughton,et al.  Towards Linear Algebra over Normalized Data , 2016, Proc. VLDB Endow..

[22]  Volker Markl,et al.  Bridging the gap: towards optimization across linear and relational algebra , 2016, BeyondMR@SIGMOD.

[23]  Leonidas Fegaras,et al.  A Query Processing Framework for Large-Scale Scientific Data Analysis , 2018, Trans. Large Scale Data Knowl. Centered Syst..