论文信息 - DAGuE: A Generic Distributed DAG Engine for High Performance Computing

DAGuE: A Generic Distributed DAG Engine for High Performance Computing

The frenetic development of the current architectures places a strain on the current state-of-the-art programming environments. Harnessing the full potential of such architectures has been a tremendous task for the whole scientific computing community. We present DAGuE a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. Applications we consider can be represented as a Direct Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a compact, problem-size independent format that can be queried on-demand to discover data dependencies, in a totally distributed fashion. DAGuE assigns computation threads to the cores, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on cache awareness, data-locality and task priority. We demonstrate the efficiency of our approach, using several micro-benchmarks to analyze the performance of different components of the framework, and a Linear Algebra factorization as a use case.

[1] Lars Karlsson,et al. Distributed SBP Cholesky factorization algorithms with near-optimal scheduling , 2009, TOMS.

[2] Serge G. Petiton,et al. Workflow Global Computing with YML , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[3] Emmanuel Jeannot,et al. AUTOMATIC MULTITHREADED PARALLEL PROGRAM GENERATION FOR MESSAGE PASSING MULTIPROCESSORS USING PARAMETERIZED TASK GRAPHS , 2002 .

[4] Robert A. van de Geijn,et al. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[5] Jack Dongarra,et al. Parallel tiled QR factorization for multicore architectures , 2008 .

[6] C. Loan,et al. A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[7] Guillaume Mercier,et al. hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[8] James Demmel,et al. ScaLAPACK: A Linear Algebra Library for Message-Passing Computers , 1997, PPSC.

[9] Emmanuel Jeannot,et al. Automatic Parallelization Techniques Based on Compact DAG Extraction and Symbolic Scheduling , 2001, Parallel Process. Lett..

[10] John A. Gunnels,et al. Minimal Data Copy for Dense Linear Algebra Factorization , 2006, PARA.

[11] Julien Langou,et al. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[12] Jack Dongarra,et al. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[13] Katherine A. Yelick,et al. Multi-threading and one-sided communication in parallel LU factorization , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[14] William Pugh,et al. The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[15] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[16] Robert A. van de Geijn,et al. Updating an LU Factorization with Pivoting , 2008, TOMS.

[17] Robert A. van de Geijn,et al. SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.

[18] Julien Langou,et al. The Impact of Multicore on Math Software , 2006, PARA.

[19] Rajkumar Buyya,et al. A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[20] Emmanuel Jeannot,et al. Compact DAG representation and its symbolic scheduling , 1999, J. Parallel Distributed Comput..

[21] Arthur J. Bernstein,et al. Analysis of Programs for Parallel Processing , 1966, IEEE Trans. Electron. Comput..

[22] G. W. Stewart,et al. Matrix algorithms , 1998 .

[23] Peter J. Denning,et al. Operating Systems Theory , 1973 .

[24] Jesús Labarta,et al. A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[25] Guy E. Blelloch,et al. The data locality of work stealing , 2000, SPAA.

[26] James Demmel,et al. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[27] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[28] Jack J. Dongarra,et al. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[29] Franck Cappello,et al. Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed , 2006, Int. J. High Perform. Comput. Appl..

[30] John A. Sharp,et al. Data flow computing: theory and practice , 1992 .

[31] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[32] George Bosilca,et al. Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project , 2010 .