Scalable Dense Linear Algebra on Heterogeneous Hardware

Design of systems exceeding 1 Pflop/s and the push toward 1 Eflop/s, forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers, as the management of multicore proliferation and heterogeneity rests on software. And it is reasonable to expect, that this situation will not change in the foreseeable future. This chapter presents a methodology of dealing with this issue in three common scenarios. In the context of shared-memory multicore installations, we show how high performance and scalability go hand in hand, when the well-known linear algebra algorithms are recast in terms of Direct Acyclic Graphs (DAGs), which are then transparently scheduled at runtime inside the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) project. Similarly, Matrix Algebra on GPU and Multicore Architectures (MAGMA) schedules DAG-driven computations on multicore processors and accelerators. Finally, Distributed PLASMA (DPLASMA), takes the approach to distributed-memory machines with the use of automatic dependence analysis and the Direct Acyclic Graph Engine (DAGuE) to deliver high performance at the scale of many thousands of cores.

[1]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[3]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[4]  Jack J. Dongarra,et al.  Exploiting Fine-Grain Parallelism in Recursive LU Factorization , 2011, PARCO.

[5]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[6]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[7]  Thomas Hérault,et al.  Performance Portability of a GPU Enabled Factorization with the DAGuE Framework , 2011, 2011 IEEE International Conference on Cluster Computing.

[8]  Emmanuel Agullo,et al.  Comparative study of one-sided factorizations with multiple software packages on multi-core hardware , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[9]  Franck Cappello,et al.  Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed , 2006, Int. J. High Perform. Comput. Appl..

[10]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[11]  Jack J. Dongarra,et al.  Scheduling dense linear algebra operations on multicore processors , 2010, Concurr. Comput. Pract. Exp..

[12]  Herb Sutter,et al.  The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software , 2013 .

[13]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[14]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[15]  Lars Karlsson,et al.  Distributed SBP Cholesky factorization algorithms with near-optimal scheduling , 2009, TOMS.

[16]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[17]  Julien Langou,et al.  Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..

[18]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[19]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[20]  Jack J. Dongarra,et al.  Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.

[21]  Jack J. Dongarra,et al.  High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures , 2013, TOMS.

[22]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[23]  Allen D. Malony,et al.  The open trace format (OTF) and open tracing for HPC , 2006, SC.

[24]  Emmanuel Agullo,et al.  Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures , 2010, VECPAR.

[25]  Jack J. Dongarra,et al.  Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[26]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[27]  DongarraJack,et al.  Parallel tiled QR factorization for multicore architectures , 2008 .

[28]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[29]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[30]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[31]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[32]  James Reinders,et al.  Intel® threading building blocks , 2008 .