Prototyping Parallel LAPACK using Block-Cyclic Distributed BLAS

Given an implementation of Distributed BLAS Level 3 kernels, the parallelization of dense linear algebra libraries such as LAPACK can be easily achieved. In this paper, we brieey describe the implementation and performance on the AP1000 of Distributed BLAS Level 3 for the rectangular r s block-cyclic matrix distribution. Then, the parallelization of the central matrix factorization and the tridiagonal reduction routines from LAPACK are described, where the algorithmic`blocking factor' w can be independent of the matrix distribution block size r. For scalar-based MIMD parallel processors with relatively low communication startup costs, such as the AP1000, it is found the optimum r and w generally satisses w >> r with r 1, diiering from results published for vector-based parallel processors.

[1]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[2]  Peter Strazdins,et al.  Linear Algebra Research on the AP1000 , 1993 .

[3]  Hiroaki Ishihata,et al.  An architecture of highly parallel computer AP 1000 , 1991, [1991] IEEE Pacific Rim Conference on Communications, Computers and Signal Processing Conference Proceedings.

[4]  Jaeyoung Choi,et al.  PB-BLAS: a set of parallel block basic linear algebra subprograms , 1996, Concurr. Pract. Exp..

[5]  R. P. Brent The LINPACK benchmark on the Fujitsu FAP 1000 , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[6]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.