论文信息 - Exploiting the memory hierarchy in sequential and parallel sparse Cholesky factorization

Exploiting the memory hierarchy in sequential and parallel sparse Cholesky factorization

Cholesky factorization of large sparse matrices is an extremely important computation, arising in a wide range of domains including linear programming, finite element analysis, and circuit simulation. This thesis investigates crucial issues for obtaining high performance for this computation on sequential and parallel machines with hierarchical memory systems. The thesis begins by providing the first thorough analysis of the interaction between sequential sparse Cholesky factorization methods and memory hierarchies. We look at popular existing methods and find that they produce relatively poor memory hierarchy performance. The methods are extended, using blocking techniques, to reuse data in the fast levels of the memory hierarchy. This increased reuse is shown to provide a three-fold speedup over popular existing approaches (e.g., SPARSPAK) on modern workstations. The thesis then considers the use of blocking techniques in parallel sparse factorization. We first describe parallel methods we have developed that are natural extensions of the sequential approach described above. These methods distribute panels (sets of contiguous columns with nearly identical non-zero structures) among the processors. The thesis shows that for small parallel machines, the resulting methods again produce substantial performance improvements over existing methods. A framework is provided for understanding the performance of these methods, and also for understanding the limitations inherent in them. Using this framework, the thesis shows that panel methods are inappropriate for large-scale parallel machines because they do not expose enough concurrency. The thesis then considers rectangular block methods, where the sparse matrix is split both vertically and horizontally. These methods address the concurrency problems of panel methods, but they also introduce a number of complications. Primary among these are issues of choosing blocks that can be manipulated efficiently and structuring a parallel computation in terms of these blocks. The thesis describes solutions to these problems and presents performance results from an efficient block method implementation. The contributions of this work come both from its theoretical foundation for understanding the factors that limit the scalability of panel- and block-oriented methods on hierarchical memory multiprocessors, and from its investigation of practical issues related to the implementation of efficient parallel factorization methods.

Edward Eric Rothberg | E. Rothberg

[1] John G. Lewis,et al. Sparse matrix test problems , 1982, SGNM.

[2] Robert Schreiber,et al. A New Implementation of Sparse Gaussian Elimination , 1982, TOMS.

[3] J. Pasciak,et al. Computer solution of large sparse positive definite systems , 1982 .

[4] Joseph W. H. Liu,et al. Modification of the minimum-degree algorithm by multiple elimination , 1985, TOMS.

[5] Joseph W. H. Liu,et al. On the storage requirement in the out-of-core multifrontal method for sparse factorization , 1986, TOMS.

[6] William Jalby,et al. Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design , 1988 .

[7] Robert Francis Lucas,et al. Solving planar systems of equations on distributed-memory multiprocessors , 1988 .

[8] Joseph W. H. Liu,et al. Reordering sparse matrices for parallel elimination , 1989, Parallel Comput..

[9] Anoop Gupta,et al. Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations , 1990, Proceedings SUPERCOMPUTING '90.

[10] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[11] Vijay K. Naik,et al. Effects of partitioning and scheduling sparse matrix factorization on communication and load balance , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[12] Anoop Gupta,et al. Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[13] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.

[14] D. Culler,et al. Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[15] Anoop Gupta,et al. The Stanford Dash multiprocessor , 1992, Computer.

[16] Ken Kennedy,et al. Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[17] Barry W. Peyton,et al. A Supernodal Cholesky Factorization Algorithm for Shared-Memory Multiprocessors , 1991, SIAM J. Sci. Comput..