Solving planar systems of equations on distributed-memory multiprocessors

The advent of VLSI has made extremely powerful, cost effective parallel computing systems practical. This has been accompanied by a tremendous increase in the demand for computing power as ever more complicated phenomena are being studied by numerical techniques. Unfortunately, the software developed over the last thirty years for solving these problems has been geared toward sequential or vector machines and is not directly applicable to the emerging highly concurrent computers. To address this problem, both direct and iterative sparse solvers have been developed for large scale, message passing multiprocessors. A new distributed multifrontal (DMF) algorithm for sparse Gaussian elimination on parallel computers is presented. This method uses the nested dissection reordering heuristic to extract separators from the graph of the matrix, thereby partitioning the matrix into disjoint blocks that can be allocated to the processors. Symbolic decomposition of the resulting matrix is shown to be completely independent. The number of messages exchanged during the sparse matrix factorization is limited to a function of the length of the separators and the DMF sparse solver achieves parallel efficiencies of over 70%. To address the computational bottleneck in an application program, the DMF sparse solver is coupled with a perfectly parallel sparse matrix assembly operation and embedded into PISCES, a 2-D device simulator developed at Stanford. Overall speedups of 8.1 are demonstrated using a 16-node hypercube multiprocessor. Symmetric positive definite systems can often be solved faster using iterative techniques. Therefore, a parallel implementation of one of the most successful iterative methods, the Incomplete Cholesky Preconditioned Conjugate Gradient (ICCG) algorithm, has been implemented. It also utilizes a nested dissection reordering and is designed to limit the volume of interprocessor communication. The ICCG sparse solver, which incorporates efficient triangular solution, matrix vector multiplication, and inner product routines, has demonstrated parallel efficiencies of over 80%. This algorithm is employed by a 3-D Poisson solver that achieves speedups of 9.8 on a 16 processor hypercube.