A Scalable Multi-Granular Data Model for Data Parallel Workflows

Scientific applications consist of many tasks and each task has different requirements for the degree of parallelism and data access pattern. To satisfy these requirements, a task scheduling has to assign required number of processes to each task and task's input has to be decomposed and arranged to these processes by considering data access pattern to exploit data locality. However, hand-writing these code is a troublesome and error-prone work. We propose a multi-view data model where users can specify rules of data decomposition for multi-dimensional data to change data layout on top of processes and define unit of parallel processing by simple directives. Our framework conducts data arrangement and affinity-aware task scheduling transparently from users by following the specified rules. Through a case study of a lattice QCD simulation program, we confirmed that our proposal reduced programming efforts against hand-written MPI code with performance penalties up to 17%.

[1]  Motohiko Matsuda,et al.  K MapReduce: A scalable tool for data-processing and search/ensemble applications on large-scale supercomputers , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[2]  Abhinav Vishnu,et al.  On the suitability of MPI as a PGAS runtime , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[3]  Robert J. Harrison,et al.  Global arrays: A nonuniform memory access programming model for high-performance computers , 1996, The Journal of Supercomputing.

[4]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[5]  Steven J. Deitz,et al.  User-defined distributions and layouts in chapel: philosophy and framework , 2010 .

[6]  Katherine A. Yelick,et al.  UPC++: A PGAS Extension for C++ , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[7]  Yoshifumi Nakamura,et al.  BQCD -- Berlin quantum chromodynamics program , 2010, 1011.0199.

[8]  Koji Terasaki,et al.  Performance evaluation of a throughput-aware framework for ensemble dataassimilation: the case of NICAM-LETKF , 2016 .

[9]  Tarek A. El-Ghazawi,et al.  An evaluation of global address space languages: co-array fortran and unified parallel C , 2005, PPoPP.

[10]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Steven J. Plimpton,et al.  MapReduce in MPI for Large-scale graph algorithms , 2011, Parallel Comput..

[13]  David B. Loveman High performance Fortran , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[14]  Vivek Sarkar,et al.  Habanero-Java: the new adventures of old X10 , 2011, PPPJ.

[15]  David A. Padua,et al.  Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.

[16]  Yi Wang,et al.  Smart: a MapReduce-like framework for in-situ scientific analytics , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.