Optimization of recurrence computations on vector and parallel computers

The aim of this monograph is to present the author’s contribution to the fields of designing vector and parallel algorithms for solving problems based on linear recurrence computations and optimizing such software for modern vector and parallel computer architectures. In the first chapter, we give a concise overview of the fundamentals of of modern computer architectures, performance analysis and methods for vector parallel programming. We also define the problem of solving linear recurrence systems and present some basic results which can be found in the literature. Chapter 2 describes the use of the Level 1 BLAS operation AXPY as a key to efficient vectorization of m-th order linear recurrence systems with constant coefficients. Applying the Hockney-Jesshope model of vector computations, we present the performance analysis of the algorithm. We also consider the influence of memory bank conflicts. The theoretical analysis is supported by the experimental results collected on two vector supercomputers manufactured by Cray. The aim of Chapter 3 is to present a new efficient BLAS-based algorithm for solving linear recurrence systems with constant coefficients, which can be easily and efficiently implemented on shared memory machines. The algorithm is based on Level 3 and Level 2 BLAS routines GEMM, GEMV and TRMV, which are crucial for its efficiency. The results of experiments performed on a various shared memory computers are also presented and discussed. The can be efficiently implemented on high performance message passing parallel and vector computers and clusters of workstations. In Chapter 4 we analyze the performance of the algorithm using two well known models: BSP and Hockney-Jesshope. Finally, we present the results of experiments performed of a Cray X1 and two different clusters running under Linux. The aim of the chapters 5, 6 and 7 is to show that the introduced algorithms for solving linear recurrence systems can be applied for solving a number of problems which arise in scientific computing. In Chapter 5 we introduce a new BLAS-based algorithm for narrow-banded triangular Toeplitz matrix-vector multiplication and show how to evaluate linear recursive filters efficiently on distributed memory parallel computers. We apply the BSP model of parallel computing to predict the behavior of the algorithm and to find the optimal values of the method’s parameters. The results of experiments performed on a cluster of twelve dual-processor Itanium 2 computers and Cray X1 are also presented and discussed. The algorithm allows to utilize up to 30% of the peak performance of 24 Itanium processors, while a simple scalar algorithm can only utilize about 4% of the peak performance of a single processor. Next, we show that the performance of the algorithm for evaluating linear recursive filters can be increased by using new generalized data structures for dense matrices introduced by F. G. Gustavson. The results of experiments performed on Intel Itanium 2, Cray X1 and dual-processor Quad-Core Xeon are also presented and discussed. In Chapter 6 we introduce a new high performance divide and conquer algorithm for finding trigonometric sums which can be applied to improve the performance of the Talbot’s method for the numerical inversion of the Laplace Transform on modern computer architectures including shared memory parallel computers. We also show how to vectorize the first stage of the Talbot’s method, namely computing all coefficients of the trigonometric sums used by the method. Numerical tests show that the improved method gives the same accuracy as the standard algorithm and it allows to utilize parallel processors. In Chapter 7 we show hot to apply our algorithms for polynomial evaluation. Finally, Chapter 8 presents the new data distribution of triangular matrices that provides steady distribution of blocks among processes and reduces memory wasting in comparison with the standard block-cyclic data layout that is used in the ScaLAPACK library for dense matrix computations. The new algorithm for solving triangular systems of linear equations is also introduced. The results of experiments performed on a cluster of Itanium 2 processors and Cray X1 show that in some cases, the new method is faster than corresponding PBLAS routines PSTRSV and PSTRSM.