Block-parallel data analysis with DIY2

DIY2 is a programming model and runtime for block-parallel analytics on distributed-memory machines. Its main abstraction is block-structured data parallelism: data are decomposed into blocks; blocks are assigned to processing elements (processes or threads); computation is described as iterations over these blocks, and communication between blocks is defined by reusable patterns. By expressing computation in this general form, the DIY2 runtime is free to optimize the movement of blocks between slow and fast memories (disk and flash vs. DRAM) and to concurrently execute blocks residing in memory with multiple threads. This enables the same program to execute in-core, out-of-core, serial, parallel, single-threaded, multithreaded, or combinations thereof. This paper describes the implementation of the main features of the DIY2 programming model and optimizations to improve performance. DIY2 is evaluated on complete analysis codes.

[1]  William Gropp,et al.  MPI-2: Extending the Message-Passing Interface , 1996, Euro-Par, Vol. I.

[2]  Rajeev Thakur,et al.  Runtime Support for Out-of-Core Parallel Programs , 1996, Input/Output in Parallel and Distributed Computer Systems.

[3]  P. Fischer,et al.  Petascale algorithms for reactor hydrodynamics , 2008 .

[4]  Rajeev Thakur,et al.  An Extended Two-Phase Method for Accessing Sections of Out-of-Core Arrays , 1996, Sci. Program..

[5]  Jack Snoeyink,et al.  A Comparison of Five Implementations of 3D Delaunay Tessellation , 2005 .

[6]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[7]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[8]  Michael Zingale,et al.  High-Performance Reactive Fluid Flow Simulations Using Adaptive Mesh Refinement on Thousands of Processors , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[9]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Robert B. Ross,et al.  The Parallel Computation of Morse-Smale Complexes , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[11]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[12]  R. van de Weygaert,et al.  Continuous fields and discrete samples: reconstruction through delaunay tessellations , 2000 .

[13]  George Karypis,et al.  BDMPI: conquering BigData with small clusters using MPI , 2013, DISCS-2013.

[14]  Robert B. Ross,et al.  Scalable parallel building blocks for custom data analysis , 2011, 2011 IEEE Symposium on Large Data Analysis and Visualization.

[15]  Prabhat,et al.  Extreme Scaling of Production Visualization Software on Diverse Architectures , 2010, IEEE Computer Graphics and Applications.

[16]  Hal Finkel,et al.  HACC , 2016, Commun. ACM.

[17]  Robert B. Ross,et al.  A Study of Parallel Particle Tracing for Steady-State and Time-Varying Flow Fields , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[18]  C. C. Law,et al.  ParaView: An End-User Tool for Large-Data Visualization , 2005, The Visualization Handbook.

[19]  Mahmut T. Kandemir,et al.  A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations , 2000, IEEE Trans. Parallel Distributed Syst..

[20]  Ryan Lewis,et al.  Parallel Computation of Persistent Homology using the Blowup Complex , 2015, SPAA.

[21]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[22]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[23]  Robert B. Ross,et al.  A configurable algorithm for parallel image-compositing applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[24]  Dmitriy Morozov,et al.  High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tessellation , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Robert B. Ross,et al.  Versatile Communication Algorithms for Data Analysis , 2012, EuroMPI.

[26]  Ravi Jain,et al.  Distributed scheduling algorithms to improve the performance of parallel data transfers , 1994, CARN.

[27]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28]  Rajeev Thakur,et al.  Passion: Optimized I/O for Parallel Applications , 1996, Computer.

[29]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[30]  Alexander Aiken,et al.  Regent: a high-productivity programming language for HPC with logical regions , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[32]  William E. Lorensen,et al.  The design and implementation of an object-oriented toolkit for 3D graphics and visualization , 1996, Proceedings of Seventh Annual IEEE Visualization '96.

[33]  Franck Cappello,et al.  Self-Adaptive Density Estimation of Particle Data , 2016, SIAM J. Sci. Comput..

[34]  Alok N. Choudhary,et al.  Communication strategies for out-of-core programs on distributed memory machines , 1995, ICS '95.

[35]  Tom Duff,et al.  Compositing digital images , 1984, SIGGRAPH.

[36]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[37]  Jeffrey Scott Vitter,et al.  External memory algorithms , 1998, ESA.

[38]  Gunther H. Weber,et al.  Distributed merge trees , 2013, PPoPP '13.

[39]  The CESAR Codesign Center: Early Results , 2012 .

[40]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[41]  Peter Brezany,et al.  Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors , 1998, LCR.

[42]  W. Schaap DTFE : the Delaunay Tessellation Field Estimator , 2007 .

[43]  John Shalf,et al.  Abstract Machine Models and Proxy Architectures for Exascale Computing , 2014, 2014 Hardware-Software Co-Design for High Performance Computing.

[44]  Thomas H. Cormen,et al.  ViC*: a compiler for virtual-memory C* , 1998, Proceedings Third International Workshop on High-Level Parallel Programming Models and Supportive Environments.

[45]  Rajeev Thakur,et al.  Compiler and runtime support for out-of-core HPF programs , 1994, ICS '94.

[46]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.