Efficient iterative processing in the SciDB parallel array engine

Many scientific data-intensive applications perform iterative computations on array data. There exist multiple engines specialized for array processing. These engines efficiently support various types of operations, but none includes native support for iterative processing. In this paper, we develop a model for iterative array computations and a series of optimizations. We evaluate the benefits of an optimized, native support for iterative array processing on the SciDB engine and real workloads from the astronomy domain.

[1]  Michael Isard,et al.  Differential Dataflow , 2013, CIDR.

[2]  Magdalena Balazinska,et al.  Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[3]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[4]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[5]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[6]  Magdalena Balazinska,et al.  Skew-resistant parallel processing of feature-extracting scientific user-defined functions , 2010, SoCC '10.

[7]  Aniruddha R. Thakar,et al.  Cross-Matching Very Large Datasets , .

[8]  Magdalena Balazinska,et al.  Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster , 2010, SSDBM.

[9]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[10]  Yanfeng Zhang,et al.  PrIter: A Distributed Framework for Prioritizing Iterative Computations , 2011, IEEE Transactions on Parallel and Distributed Systems.

[11]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[12]  Magdalena Balazinska,et al.  Time travel in a scientific array database , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[13]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[14]  Magdalena Balazinska,et al.  ArrayStore: a storage manager for complex parallel array processing , 2011, SIGMOD '11.

[15]  S. Madden,et al.  SS-DB : A Standard Science DBMS Benchmark , 2010 .

[16]  Michael Stonebraker,et al.  GenBase: a complex analytics genomics benchmark , 2014, SIGMOD Conference.

[17]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[18]  Andrew J. Connolly,et al.  Statistics, Data Mining, and Machine Learning in Astronomy , 2014 .

[19]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[20]  Peter Baumann,et al.  Storage of multidimensional arrays based on arbitrary tiling , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[21]  Marianne Winslett,et al.  Physical schemas for large multidimensional arrays in scientific computing applications , 1994, Seventh International Working Conference on Scientific and Statistical Database Management.

[22]  Philip A. Pinto,et al.  The Large Synoptic Survey Telescope , 2006 .

[23]  Sudipto Guha,et al.  REX: Recursive, Delta-Based Data-Centric Computation , 2012, Proc. VLDB Endow..

[24]  Dan Suciu,et al.  Optimizing Large-Scale Semi-Naïve Datalog Evaluation in Hadoop , 2012, Datalog.

[25]  Magdalena Balazinska,et al.  A Demonstration of Iterative Parallel Array Processing in Support of Telescope Image Analysis , 2013, Proc. VLDB Endow..

[26]  Arie Shoshani,et al.  Parallel data analysis directly on scientific file formats , 2014, SIGMOD Conference.

[27]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[28]  Volker Markl,et al.  Spinning Fast Iterative Data Flows , 2012, Proc. VLDB Endow..