General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform

We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floa tingpoint co-processors to accelerate two fundamental computa tional scientific kernels on the GPU: sparse direct factorization a nd nonlinear interior-point optimization. Since a full re-imple mentation of these complex kernels is typically not feasible, we ident ify e.g. the matrix-matrix multiplication as a first natural entry-p oint for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip. We exploit the architectural features of the GeForce 8800 GPU t o design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resu lting in an overall performance of over110 GFlops/s on the desktop for large matrices. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications.

[1]  Robert Strzodka,et al.  Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations , 2007, Int. J. Parallel Emergent Distributed Syst..

[2]  Nicholas I. M. Gould,et al.  A numerical evaluation of sparse direct solvers for the solution of large sparse symmetric linear systems of equations , 2007, TOMS.

[3]  DemmelJames,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009 .

[4]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[5]  Rüdiger Westermann,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, SIGGRAPH Courses.

[6]  Olaf Schenk,et al.  Solving unsymmetric sparse systems of linear equations with PARDISO , 2002, Future Gener. Comput. Syst..

[7]  J. Krüger,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, ACM Trans. Graph..

[8]  Barry W. Peyton,et al.  Block sparse Cholesky algorithms on advanced uniprocessor computers , 1991 .

[9]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[10]  Eitan Grinspun,et al.  Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.

[11]  Jack Dongarra,et al.  Implementation of the Mixed-Precision High Performance LINPACK Benchmark on the CELL Processor , 2006 .

[12]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[13]  Dinesh Manocha,et al.  LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[14]  Anoop Gupta,et al.  Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations , 1990, Proceedings SUPERCOMPUTING '90.

[15]  Timothy A. Davis,et al.  Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.

[16]  K. Kunisch,et al.  Augmented Lagrangian--SQP Methods for Nonlinear OptimalControl Problems of Tracking Type , 1996 .