On the Tunability of a New Hessenberg Reduction Algorithm Using Parallel Cache Assignment

The reduction of a general dense square matrix to Hessenberg form is a well known first step in many standard eigenvalue solvers. Although parallel algorithms exist, the Hessenberg reduction is one of the bottlenecks in AED, a main part in state-of-the-art software for the distributed multishift QR algorithm. We propose a new NUMA-aware algorithm that fits the context of the QR algorithm and evaluate the sensitivity of its algorithmic parameters. The proposed algorithm is faster than LAPACK for all problem sizes and faster than ScaLAPACK for the relatively small problem sizes typical for AED.

[1]  R. C. Whaley,et al.  Achieving Scalable Parallelization for the Hessenberg Factorization , 2011, 2011 IEEE International Conference on Cluster Computing.

[2]  Lars Karlsson,et al.  Evaluation of the Tunability of a New NUMA-Aware Hessenberg Reduction Algorithm , 2016 .

[3]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[4]  Robert A. van de Geijn,et al.  The libflame Library for Dense Matrix Computations , 2009, Computing in Science & Engineering.

[5]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part II: Aggressive Early Deflation , 2001, SIAM J. Matrix Anal. Appl..

[6]  Lars Karlsson,et al.  Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures , 2011, Parallel Comput..

[7]  R. C. Whaley,et al.  Scaling LAPACK panel operations using parallel cache assignment , 2013, TOMS.

[8]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[9]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[10]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[11]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance , 2001, SIAM J. Matrix Anal. Appl..

[12]  Daniel Kressner,et al.  Algorithm 953 , 2015 .

[13]  Anil V. Rao,et al.  GPOPS-II , 2014, ACM Trans. Math. Softw..

[14]  Paul I. Barton,et al.  Evaluating an element of the Clarke generalized Jacobian of a composite piecewise differentiable function , 2013, TOMS.

[15]  R. C. Whaley,et al.  Effectively Exploiting Parallel Scale for All Problem Sizes in LU Factorization , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[16]  Robert A. van de Geijn,et al.  Improving the performance of reduction to Hessenberg form , 2006, TOMS.