Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach

Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach George Bosilca Thomas Herault Aurelien Bouteiller Piotr Luszczek Anthony Danalis Jack J. Dongarra January 24, 2012 Introduction and Motivation Among the various factors that drive the momentous changes occurring in the design of microprocessors and high end systems [1], three stand out as especially notable: 1. the number of transistors per chip will continue the current trend, i.e. double roughly every 18 months, while the speed of processor clocks will cease to in- crease; 2. the physical limit on the number and bandwidth of the CPUs pins is becoming a near-term reality; 3. a strong drift toward hybrid/heterogeneous systems for petascale (and larger) systems is taking place. While the first two involve fundamental physical limitations that current technology trends are unlikely to overcome in the near term, the third is an obvious consequence of the first two, combined with the economic necessity of using many thousands of computational units to scale up to petascale and larger systems. More transistors and slower clocks require multicore designs and an increased par- allelism. The fundamental laws of traditional processor design – increasing transistor density, speeding up clock rate, lowering voltage – have now been stopped by a set of physical barriers: excess heat produced, too much power consumed, too much energy leaked, and useful signal overcome by noise. Multicore designs are a natural evolu- tionary response to this situation. By putting multiple processor cores on a single die, architects can overcome the previous limitations, and continue to increase the num- ber of gates per chip without increasing the power densities. However, since excess heat production means that frequencies cannot be further increased, deep-and-narrow pipeline models will tend to recede as shallow-and-wide pipeline designs become the norm. Moreover, despite obvious similarities, multicore processors are not equiva- lent to multiple-CPUs or to SMPs. Multiple cores on the same chip can share various

[1]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[2]  Katherine A. Yelick,et al.  Multi-threading and one-sided communication in parallel LU factorization , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[3]  J. Hess Panel Methods in Computational Fluid Dynamics , 1990 .

[4]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[5]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[6]  Julien Langou,et al.  Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..

[7]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[8]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[9]  Serge G. Petiton,et al.  Workflow Global Computing with YML , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[10]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[11]  J. Hess,et al.  Calculation of potential flow about arbitrary bodies , 1967 .

[12]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[13]  Robert A. van de Geijn,et al.  SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.

[14]  Julien Langou,et al.  The Impact of Multicore on Math Software , 2006, PARA.

[15]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[16]  John A. Sharp,et al.  Data flow computing: theory and practice , 1992 .

[17]  J.J.H. Wang Generalised moment methods in electromagnetics , 1990 .

[18]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[19]  Emmanuel Jeannot,et al.  Compact DAG representation and its symbolic scheduling , 1999, J. Parallel Distributed Comput..

[20]  Eduardo F. D'Azevedo,et al.  Complex version of high performance computing LINPACK benchmark (HPL) , 2010, Concurr. Comput. Pract. Exp..

[21]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[22]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[23]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[24]  Lars Karlsson,et al.  Distributed SBP Cholesky factorization algorithms with near-optimal scheduling , 2009, TOMS.

[25]  Allen D. Malony,et al.  The open trace format (OTF) and open tracing for HPC , 2006, SC.

[26]  Herb Sutter,et al.  The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software , 2013 .

[27]  Emmanuel Jeannot,et al.  AUTOMATIC MULTITHREADED PARALLEL PROGRAM GENERATION FOR MESSAGE PASSING MULTIPROCESSORS USING PARAMETERIZED TASK GRAPHS , 2002 .

[28]  R. Dolbeau,et al.  HMPP TM : A Hybrid Multi-core Parallel Programming Environment , 2022 .

[29]  Thomas Hérault,et al.  Performance Portability of a GPU Enabled Factorization with the DAGuE Framework , 2011, 2011 IEEE International Conference on Cluster Computing.

[30]  Arthur J. Bernstein,et al.  Analysis of Programs for Parallel Processing , 1966, IEEE Trans. Electron. Comput..

[31]  Jack J. Dongarra,et al.  Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[32]  Franck Cappello,et al.  Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed , 2006, Int. J. High Perform. Comput. Appl..

[33]  G. W. Stewart,et al.  The decompositional approach to matrix computation , 2000, Comput. Sci. Eng..

[34]  Alan Edelman,et al.  Large Dense Numerical Linear Algebra in 1993: the Parallel Computing Influence , 1993, Int. J. High Perform. Comput. Appl..

[35]  R. Harrington Origin and development of the method of moments for field computation , 1990, IEEE Antennas and Propagation Magazine.

[36]  Peter J. Denning,et al.  Operating Systems Theory , 1973 .

[37]  Jack J. Dongarra,et al.  Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[38]  Emmanuel Jeannot,et al.  Automatic Parallelization Techniques Based on Compact DAG Extraction and Symbolic Scheduling , 2001, Parallel Process. Lett..