Translation of array-based loops to distributed data-parallel programs

Large volumes of data generated by scientific experiments and simulations come in the form of arrays, while programs that analyze these data are frequently expressed in terms of array operations in an imperative, loop-based language. But, as datasets grow larger, new frameworks in distributed Big Data analytics have become essential tools to large-scale scientific computing. Scientists, who are typically comfortable with numerical analysis tools but are not familiar with the intricacies of Big Data analytics, must now learn to convert their loop-based programs to distributed data-parallel programs. We present a novel framework for translating programs expressed as array-based loops to distributed data parallel programs that is more general and efficient than related work. We report on a prototype implementation on top of Spark and evaluate the performance of our system relative to hand-written programs.

[1]  Magdalena Balazinska,et al.  Efficient iterative processing in the SciDB parallel array engine , 2015, SSDBM.

[2]  Peng Jiang,et al.  Revealing parallel scans and reductions in recurrences through function reconstruction , 2018, PACT.

[3]  Leonidas Fegaras,et al.  A Query Processing Framework for Array-Based Computations , 2016, DEXA.

[4]  Akimasa Morihata,et al.  Automatic inversion generates divide-and-conquer parallel programs , 2007, PLDI '07.

[5]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[6]  Leonidas Fegaras,et al.  Compile-Time Code Generation for Embedded Data-Intensive Query Languages , 2018, 2018 IEEE International Congress on Big Data (BigData Congress).

[7]  Azadeh Farzan,et al.  Synthesis of divide and conquer parallelism for loops , 2017, PLDI.

[8]  FegarasLeonidas,et al.  Translation of array-based loops to distributed data-parallel programs , 2020, VLDB 2020.

[9]  S. Sudarshan,et al.  Extracting Equivalent SQL from Imperative Code in Database Applications , 2016, SIGMOD Conference.

[10]  Guy E. Blelloch,et al.  Compiling Collection-Oriented Languages onto Massively Parallel Computers , 1990, J. Parallel Distributed Comput..

[11]  Leonidas Fegaras,et al.  An algebra for distributed Big Data analytics , 2017, Journal of Functional Programming.

[12]  Tilmann Rabl,et al.  An Intermediate Representation for Optimizing Machine Learning Pipelines , 2019, Proc. VLDB Endow..

[13]  Yinghui Wu,et al.  Parallelizing Sequential Graph Computations , 2018, ACM Trans. Database Syst..

[14]  Manu Sridharan,et al.  Translating imperative code to MapReduce , 2014, OOPSLA 2014.

[15]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[16]  Patrick Seemann,et al.  Matrix Factorization Techniques for Recommender Systems , 2014 .

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[19]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[20]  Yi Wang,et al.  SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[21]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[22]  S. Sudarshan,et al.  Rewriting procedures for batched bindings , 2008, Proc. VLDB Endow..

[23]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[24]  Guangwen Yang,et al.  SciHive: Array-Based Query Processing with HiveQL , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[25]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[26]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[27]  Carlos Maltzahn,et al.  SciHadoop: Array-based query processing in Hadoop , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[28]  Daniel W. Palmer,et al.  Work-efficient nested data-parallelism , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[29]  Magdalena Balazinska,et al.  ArrayStore: a storage manager for complex parallel array processing , 2011, SIGMOD '11.

[30]  Aws Albarghouthi,et al.  MapReduce program synthesis , 2016, PLDI.

[31]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[32]  Maaz Bin Safeer Ahmad,et al.  Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications , 2018, SIGMOD Conference.

[33]  Robert A. van de Geijn,et al.  SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[34]  Maaz Bin Safeer Ahmad,et al.  Gradual synthesis for static parallelization of single-pass array-processing programs , 2017, PLDI.

[35]  Azadeh Farzan,et al.  Modular Synthesis of Divide-and-Conquer Parallelism for Nested Loops (Extended Version) , 2019, ArXiv.

[36]  Stavros Papadopoulos,et al.  The TileDB Array Data Storage Manager , 2016, Proc. VLDB Endow..

[37]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[38]  Krishna M. Kavi,et al.  Parallelization of DOALL and DOACROSS Loops - A Survey , 1997, Adv. Comput..

[39]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[40]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[41]  Allan L. Fisher,et al.  Parallelizing complex scans and reductions , 1994, PLDI '94.

[42]  David Maier,et al.  Optimizing object queries using an effective calculus , 2000, TODS.

[43]  Hongbo Rong,et al.  Automating Wavefront Parallelization for Sparse Matrix Computations , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.