Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems

The most computationally demanding scientific problems are solved with large parallel systems. In some cases these systems are Non-Uniform Memory Access (NUMA) multiprocessors made up of a large number of cores which share a hierarchically organized memory. The main basic component of these scientific codes is often matrix multiplication, and the efficient development of other linear algebra packages is directly based on the matrix multiplication routine implemented in the BLAS library. BLAS library is used in the form of packages implemented by the vendors or free implementations. The latest versions of this library are multithreaded and can be used efficiently in multicore systems, but when they are used inside parallel codes, the two parallelism levels can interfere and produce a degradation of the performance. In this work, an auto-tuning method is proposed to select automatically the optimum number of threads to use at each parallel level when multithreaded linear algebra routines are called from OpenMP parallel codes. The method is based on a simple but effective theoretical model of the execution time of the two-level routines. The methodology is applied to a two-level matrix-matrix multiplication and to different matrix factorizations (LU, QR and Cholesky) by blocks. Traditional schemes which directly use the multithreaded routine of BLAS, dgemm, are compared with schemes combining the multithreaded dgemm with OpenMP.

[1]  Javier Cuenca,et al.  Architecture of an automatically tuned linear algebra library , 2004, Parallel Comput..

[2]  Takahiro Katagiri,et al.  d-Spline Based Incremental Parameter Estimation in Automatic Performance Tuning , 2006, PARA.

[3]  Anthony Skjellum,et al.  Driving Issues in Scalable Libraries: Poly-Algorithms, Data Distribution Independence, Redistribution, Local Storage Schemes , 1995, PPSC.

[4]  Javier Cuenca,et al.  Improving Linear Algebra Computation on NUMA Platforms through Auto-tuned Nested Parallelism , 2012, 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[5]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[6]  Javier Cuenca,et al.  Towards the design of an automatically tuned linear algebra library , 2002, Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing.

[7]  S. Akhter,et al.  Multi-core programming , 2006 .

[8]  Javier Cuenca,et al.  Designing polylibraries to speed up linear algebra computations , 2004, Int. J. High Perform. Comput. Netw..

[9]  Daniel Kressner,et al.  Block variants of Hammarling's method for solving Lyapunov equations , 2008, TOMS.

[10]  Takahiro Katagiri,et al.  ABCLib_DRSSED: A parallel eigensolver with an auto-tuning facility , 2006, Parallel Comput..

[11]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[12]  Eddy Caron,et al.  Parallel Extension of a Dynamic Performance Forecasting Tool , 2001, Scalable Comput. Pract. Exp..

[13]  James Demmel,et al.  Statistical Models for Automatic Performance Tuning , 2001, International Conference on Computational Science.

[14]  Alexey L. Lastovetsky,et al.  Building the functional performance model of a processor , 2006, SAC.

[15]  Sathish S. Vadhiyar,et al.  Numerical Libraries and the Grid , 2001, Int. J. High Perform. Comput. Appl..

[16]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[17]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[18]  Alexey L. Lastovetsky,et al.  HeteroMPI+ScaLAPACK: Towards a ScaLAPACK (Dense Linear Solvers) on Heterogeneous Networks of Computers , 2006, HiPC.

[19]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[20]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[21]  Jack Dongarra,et al.  ScaLAPACK user's guide , 1997 .

[22]  Takahiro Katagiri,et al.  FIBER: A Generalized Framework for Auto-tuning Software , 2003, ISHPC.

[23]  Javier Cuenca,et al.  Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.

[24]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..