Blocked Algorithms for Reduction of a Regular Matrix Pair to Generalized Schur Form

This contribution considers the problem of transforming a regular matrix pair (A;B) to generalized Schur form. The focus is on blocked algorithms for the reduction process that typically includes two major steps. The rst is a two-stage reduction of a regular matrix pair (A;B) to condensed form (H;T ) using orthogonal transformations Q and Z such that H = QAZ is upper Hessenberg and T = QBZ is upper triangular. The second step is the reduction of the (H;T ) pair to generalized Schur (GS) form (S; T ), where S is upper quasi-triangular and T is upper triangular. A quasi-triangular matrix is block upper triangular with 1 1 and 2 2 blocks on the diagonal. The 2 2 blocks on the diagonal of S T correspond to pairs of complex conjugate eigenvalues. These two steps are the major operations when solving for all eigenvalues and eigenvectors of the generalized eigenvalue problem Ax = Bx, where A and B are dense square real n n matrices. An elementwise (unblocked) algorithm that reduces a matrix pair (A;B) to upper Hessenberg-triangular form (H;T ) is described in [7]. This algorithm is implemented in the LAPACK routine DGGHRD [1]. Since the algorithm is unblocked the reuse of cached data is low. However, if we allow H to be block upper Hessenberg we can design more e cient algorithms. In [4] several di erent blocked algorithms that reduce a regular matrix pair to block upper Hessenberg-triangular form are described, implemented and evaluated. These blocked algorithms permit reuse of data in the higher levels of a memory hierarchy, which make them run closer to peak machine speed. We present a parallel version of one of the blocked variants, denoted p blockHT, based on Householder re ections and compact WY [9] representation of the Householder matrices in [6]. The target architecture is a scalable MIMD computer system that can embed any rectangular (Pr Pc) 2D processor grid and that can emulate the corresponding communication topology e ciently. The matrices, A and B, are partitioned and distributed over the Pr Pc processor grid using square block scattered decomposition [3]. A parallel algorithm to reduce just one matrix to block upper Hessenberg from is presented in [2]. In addition to the parallel implementation we provide an hierarchical performance model [5] that predicts the execution time of the algorithm for di erent parameters. Examples of parameters in the model are the matrix block size and topology information of the logical architecture. The model is used as an algorithm analysis tool and for selection of optimal block sizes and grid sizes. To further reduce the block upper Hessenberg-triangular matrix pair to upper Hessenberg-triangular form we use Givens rotations. The elements below the subdiagonal of H are annihilated column by column until H obtain the desired form. All rotations