Orthogonalization on a General Purpose Graphics Processing Unit with Double Double and Quad Double Arithmetic

Our problem is to accurately solve linear system son a general purpose graphics processing unit with double double and quad double arithmetic. The linear systems originate from the application of Newton's method on polynomial systems. Newton's method is applied as a corrector in a path tracking method, so the linear systems are solved in sequence and not simultaneously. One solution path may require the solution of thousands of linear systems. In previous work we reported good speedups with our implementation to evaluate and differentiate polynomial systems on the NVIDIA Tesla C2050. Although the cost of evaluation and differentiation often dominates the cost of linear system solving in Newton's method, because of the limited bandwidth of the communication between CPU and GPU, we cannot afford to send the linear system to the CPU for solving during path tracking. Because of large degrees, the Jacobian matrix may contain extreme values, requiring extended precision, leading to a significant overhead. This overhead of multiprecision arithmetic is our main motivation to develop a massively parallel algorithm. To allow over determined linear systems we solve linear systems in the least squares sense, computing the QR decomposition of the matrix by the modified Gram-Schmidt algorithm. We describe our implementation of the modified Gram-Schmidt orthogonalization method using double double and quad double arithmetic for GPUs. Our experimental results on the NVIDIA Tesla C2050 and K20Cshow that the achieved speedups are sufficiently high to compensate for the overhead of one extra level of precision.

[1]  Douglas M. Priest On properties of floating point arithmetics: numerical stability and the cost of accurate computations , 1992 .

[2]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[3]  Jan Verschelde,et al.  Polynomial homotopies on multicore workstations , 2010, PASCO.

[4]  Dianne P. O'Leary,et al.  Parallel QR factorization by Householder and modified Gram-Schmidt algorithms , 1990, Parallel Comput..

[5]  Daisuke Takahashi,et al.  Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[6]  Marc Moreno Maza,et al.  Solving Bivariate Polynomial Systems on a GPU , 2011, ACCA.

[7]  Tien-Yien Li Numerical Solution of Polynomial Systems by Homotopy Continuation Methods , 2003 .

[8]  Mitsuhisa Sato,et al.  Efficient Parallel Implementation of Classical Gram-Schmidt Orthogonalization Using Matrix Multiplication , 2006 .

[9]  Mark A. Richards,et al.  QR decomposition on GPUs , 2009, GPGPU-2.

[10]  Josef Stoer,et al.  Numerische Mathematik 1 , 1989 .

[11]  Jan Verschelde,et al.  Quality Up in Polynomial Homotopy Continuation by Multithreaded Path Tracking , 2011, ArXiv.

[12]  Masha Sosonkina,et al.  Algorithm 857: POLSYS_GLP—a parallel general linear product homotopy code for solving polynomial systems of equations , 2006, TOMS.

[13]  T. J. Dekker,et al.  A floating-point technique for extending the available precision , 1971 .

[14]  E. Allgower,et al.  Introduction to Numerical Continuation Methods , 1987 .

[15]  Xiaoye S. Li,et al.  Algorithms for quad-double precision floating point arithmetic , 2000, Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001.

[16]  Michael Schwind,et al.  Comparison of Different Parallel Modified Gram-Schmidt Algorithms , 2005, Euro-Par.

[17]  Jack J. Dongarra,et al.  Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[18]  Bingsheng He,et al.  Supporting extended precision on graphics processors , 2010, DaMoN '10.

[19]  Selim G. Akl Superlinear Performance in Real-Time Parallel Computation , 2004, The Journal of Supercomputing.

[20]  Jan Verschelde,et al.  Algorithm 795: PHCpack: a general-purpose solver for polynomial systems by homotopy continuation , 1999, TOMS.

[21]  Jan Verschelde,et al.  Evaluating Polynomials in Several Variables and their Derivatives on a GPU Computing Processor , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[22]  L. Trefethen,et al.  Average-case stability of Gaussian elimination , 1990 .

[23]  James Demmel,et al.  Communication-Avoiding QR Decomposition for GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[24]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[25]  F. J. Lingen Efficient Gram–Schmidt orthonormalisation on parallel computers , 2000 .

[26]  Joanna Porter-Sobieraj,et al.  Solving systems of polynomial equations on a GPU , 2012, 2012 Federated Conference on Computer Science and Information Systems (FedCSIS).

[27]  Kurt Keutzer,et al.  A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[28]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.