Performance Modeling and Optimal Block Size Selection for the Small-Bulge Multishift QR Algorithm

The small-bulge multishift QR algorithm proposed by Braman, Byers and Mathias is one of the most efficient algorithms for computing the eigenvalues of nonsymmetric matrices on processors with hierarchical memory. However, to fully extract its potential performance, it is crucial to choose the block size m properly according to the target architecture and the matrix size n. In this paper, we construct a performance model for this algorithm. The model has a hierarchical structure that reflects the structure of the original algorithm and given n, m and the performance data of the basic components of the algorithm, such as the level-3 BLAS routines and the double implicit shift QR routine, predicts the total execution time. Experiments on SMP machines with PowerPC G5 and Opteron processors show that the variation of the execution time as a function of m predicted by the model agrees well with the measurements. Thus our model can be used to automatically select the optimal value of m for a given matrix size on a given architecture.

[1]  Victor Eijkhout,et al.  Self-adapting numerical software (SANS) effort , 2006, IBM J. Res. Dev..

[2]  Jack J. Dongarra,et al.  A Parallel Implementation of the Nonsymmetric QR Algorithm for Distributed Memory Architectures , 2002, SIAM J. Sci. Comput..

[3]  Krister Dackland,et al.  A Hierarchical Approach for Performance Analysis of ScaLAPACK-Based Routines Using the Distributed Linear Algebra Machine , 1996, PARA.

[4]  David S. Watkins,et al.  Shifting Strategies for the Parallel QR Algorithm , 1994, SIAM J. Sci. Comput..

[5]  James Demmel,et al.  On a Block Implementation of Hessenberg Multishift QR Iteration , 1989, Int. J. High Speed Comput..

[6]  Daniel Kressner,et al.  Numerical Methods for General and Structured Eigenvalue Problems , 2005, Lecture Notes in Computational Science and Engineering.

[7]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[8]  V. Kublanovskaya On some algorithms for the solution of the complete eigenvalue problem , 1962 .

[9]  Y. Kanada,et al.  A Methodology for Automatically Tuned Parallel Tridiagonalization on Distributed Memory Vector-parallel Machines , 2000 .

[10]  David S. Watkins,et al.  The transmission of shifts and shift blurring in the QR algorithm , 1996 .

[11]  J. G. F. Francis,et al.  The QR Transformation - Part 2 , 1962, Comput. J..

[12]  David S. Watkins Bidirectional chasing algorithms for the eigenvalue problem , 1993 .

[13]  J. G. F. Francis,et al.  The QR Transformation A Unitary Analogue to the LR Transformation - Part 1 , 1961, Comput. J..

[14]  Victor Eijkhout,et al.  Self-Adapting Numerical Software for Next Generation Applications , 2003, Int. J. High Perform. Comput. Appl..

[15]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[16]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance , 2001, SIAM J. Matrix Anal. Appl..

[17]  Javier Cuenca,et al.  Architecture of an automatically tuned linear algebra library , 2004, Parallel Comput..

[18]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[19]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[20]  Javier Cuenca,et al.  Empirical Modelling of Parallel Linear Algebra Routines , 2003, PPAM.

[21]  Y. Yamamoto,et al.  Performance modeling and optimal block size selection for a BLAS-3 based tridiagonalization algorithm , 2005, Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05).

[22]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .