Spartan: A Distributed Array Framework with Smart Tiling

Application programmers in domains like machine learning, scientific computing, and computational biology are accustomed to using powerful, high productivity array languages such as MatLab, R and NumPy. Distributed array frameworks aim to scale array programs across machines. However, maximizing the locality of access to distributed arrays is an unsolved problem; such locality is critical for high performance. This paper presents Spartan, a distributed array framework that automatically determines how to best partition (aka "tile") n-dimensional arrays and to co-locate data with computation to maximize locality. Spartan combines a lazy-evaluation based, optimizing frontend with a distributed tiled array backend. Central to Spartan's design is a small number of carefully chosen parallel high-level operators, which form the expression graph captured by Spartan's frontend during runtime. These operators simplify the programming of distributed applications. More importantly, their well-defined semantics allow Spartan's runtime to calculate the costs of different tiling strategies and pick the best one for evaluating the entire expression graph. Using Spartan, we have implemented 12 applications from a variety of domains including machine learning and scientific computing. Our evaluations show that Spartan's automatic tiling mechanism leads to good and scalable performance while eliminating the need for manual tiling.

[1]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[2]  Franz Franchetti,et al.  Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[3]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[4]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[5]  Robert J. Harrison,et al.  Global arrays: A nonuniform memory access programming model for high-performance computers , 1996, The Journal of Supercomputing.

[6]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[7]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[8]  Ulrich Kremer,et al.  NP-completeness of Dynamic Remapping , 1993 .

[9]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[10]  Jinyang Li,et al.  Building fast, distributed programs with partitioned tables , 2010 .

[11]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[14]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[15]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[16]  P. Sadayappan,et al.  Communication-Free Hyperplane Partitioning of Nested Loops , 1993, J. Parallel Distributed Comput..

[17]  Scott A. Mahlke,et al.  Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  Guy L. Steele,et al.  Data Optimization: Allocation of Arrays to Reduce Communication on SIMD Machines , 1990, J. Parallel Distributed Comput..

[19]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[20]  Carlos Maltzahn,et al.  SciHadoop: Array-based query processing in Hadoop , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[21]  Uday Bondhugula,et al.  Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[22]  Lawrence Snyder,et al.  ZPL: An Array Sublanguage , 1993, LCPC.

[23]  Marina C. Chen,et al.  The Data Alignment Phase in Compiling Programs for Distrubuted-Memory Machines , 1991, J. Parallel Distributed Comput..

[24]  Jean-Philippe Martin,et al.  Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.

[25]  John Cavazos,et al.  Trace-Driven Memory Access Pattern Recognition in Computational Kernels , 2014 .

[26]  John Glauert,et al.  SISAL: streams and iteration in a single assignment language. Language reference manual, Version 1. 2. Revision 1 , 1985 .

[27]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[28]  Jingke Li,et al.  Index domain alignment: minimizing cost of cross-referencing between distributed arrays , 1990, [1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation.

[29]  Oscar R. Hernandez,et al.  Open64-based Regular Stencil Shape Recognition in HERCULES , 2013 .

[30]  Michael Philippsen,et al.  Automatic alignment of array data and processes to reduce communication time on DMPPs , 1995, PPOPP '95.

[31]  Yingwei Luo,et al.  Optimal Cache Partition-Sharing , 2015, 2015 44th International Conference on Parallel Processing.

[32]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[33]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[34]  Jacobi Pattern Driven Automatic Parallelization , 2004 .

[35]  Michael Stonebraker,et al.  SciDB: A Database Management System for Applications with Complex Analytics , 2013, Computing in Science & Engineering.

[36]  Marwan A. Jabri,et al.  Automatic array alignment in parallel Matlab scripts , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[37]  Keshav Pingali,et al.  Solving Alignment Using Elementary Linear Algebra , 1994, LCPC.

[38]  Santosh G. Abraham,et al.  Compiler techniques for data partitioning of sequentially iterated parallel loops , 1990, ICS '90.

[39]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[40]  Gustavo Alonso,et al.  Pydron: Semi-Automatic Parallelization for Multi-Core and the Cloud , 2014, OSDI.

[41]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[42]  J. Ramanujam,et al.  A methodology for parallelizing programs for multicomputers and complex memory multiprocessors , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[43]  Guy E. Blelloch,et al.  NESL: A Nested Data-Parallel Language (Version 2.6) , 1993 .

[44]  Tim Kraska,et al.  MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.

[45]  Ken Kennedy,et al.  Automatic data layout for distributed-memory machines , 1998, TOPL.

[46]  Jack Dongarra,et al.  LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.

[47]  J. Ramanujam,et al.  Compile-Time Techniques for Data Distribution in Distributed Memory Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[48]  Zhengping Qian,et al.  MadLINQ: large-scale distributed matrix computation for the cloud , 2012, EuroSys '12.

[49]  Robert R. Lewis,et al.  Using the Global Arrays Toolkit to Reimplement NumPy for Distributed Computation , 2011 .

[50]  John Glauert,et al.  SISAL: streams and iteration in a single-assignment language. Language reference manual, Version 1. 1 , 1983 .

[51]  Henry G. Dietz,et al.  Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation , 1991, LCPC.

[52]  Paul M. Anderson The Use and Limitations of Static-Analysis Tools to Improve Software Quality , 2008 .

[53]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[54]  Christopher Olston,et al.  Interactive Analysis of Web-Scale Data , 2009, CIDR.

[55]  Alan Edelman,et al.  Parallel MATLAB: Doing it Right , 2005, Proceedings of the IEEE.

[56]  Jean-Luc Gaudiot,et al.  The Sisal model of functional programming and its implementation , 1997, Proceedings of IEEE International Symposium on Parallel Algorithms Architecture Synthesis.

[57]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[58]  Guy E. Blelloch,et al.  NESL: A Nested Data-Parallel Language , 1992 .

[59]  Michael A. Frumkin,et al.  Automatic Recognition of Performance Idioms in Scientific Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[60]  Murray Stokely,et al.  Large-Scale Parallel Statistical Forecasting Computations in R , 2011 .

[61]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[62]  Erik H. D'Hollander,et al.  Partitioning and Labeling of Index Sets in DO Loops with Constant Dependence Vectors , 1989, ICPP.

[63]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[64]  Alvin AuYoung,et al.  Presto: distributed machine learning and graph processing with sparse matrices , 2013, EuroSys '13.