Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design

Linear algebra algorithms based on the BLAS or ex tended BLAS do not achieve high performance on mul tivector processors with a hierarchical memory system because of a lack of data locality. For such machines, block linear algebra algorithms must be implemented in terms of matrix-matrix primitives (BLAS3). Designing ef ficient linear algebra algorithms for these architectures requires analysis of the behavior of the matrix-matrix primitives and the resulting block algorithms as a func tion of certain system parameters. The analysis must identify the limits of performance improvement possible via blocking and any contradictory trends that require trade-off consideration. We propose a methodology that facilitates such an analysis and use it to analyze the per formance of the BLAS3 primitives used in block methods. A similar analysis of the block size-perfor mance relationship is also performed at the algorithm level for block versions of the LU decomposition and the Gram-Schmidt orthogonalization procedures.

[1]  James Hardy Wilkinson,et al.  On the stability of Gauss-Jordan elimination with pivoting , 1975, CACM.

[2]  William Jalby,et al.  The use of BLAS3 in linear algebra on a parallel processor with a hierarchical memory , 1987 .

[3]  D J Kuck,et al.  Parallel Supercomputing Today and the Cedar Approach , 1986, Science.

[4]  Jack J. Dongarra,et al.  A proposal for an extended set of Fortran Basic Linear Algebra Subprograms , 1985, SGNM.

[5]  Chris R. Jesshope,et al.  Parallel Computers 2: Architecture, Programming and Algorithms , 1981 .

[6]  Johnvan Rosendale On theImpact ofCommunication Complexity onthe Designof Parallel NumericalAlgorithms , 1984 .

[7]  Charles L. Seitz,et al.  The cosmic cube , 1985, CACM.

[8]  Donald A. Calahan,et al.  Block-Oriented, Local-Memory Based Linear Equation Solution on the Cray-2 Uniprocessor Algorithms , 1986, ICPP.

[9]  Gene H. Golub,et al.  Matrix computations , 1983 .

[10]  Alexander V. Veidenbaum,et al.  The Performance of Software-managed Multiprocessor Caches on Parallel Numerical Programs , 1987, ICS.

[11]  William Gropp,et al.  A comparison of domain decomposition techniques for elliptic partial differential equations and their parallel implementation , 1985, PP.

[12]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[13]  Daniel Gajski,et al.  CEDAR: a large scale multiprocessor , 1983, CARN.

[14]  V. Klema LINPACK user's guide , 1980 .

[15]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[16]  Peter M. Kogge,et al.  The Architecture of Pipelined Computers , 1981 .