论文信息 - Solvers on advanced parallel architectures with application to partial differential equations and discrete optimisation

Solvers on advanced parallel architectures with application to partial differential equations and discrete optimisation

This thesis investigates techniques for the solution of partial differential equations (PDE) on advanced parallel architectures comprising central processing units (CPU) and graphics processing units (GPU). Many physical phenomena studied by scientists and engineers aremodelled with PDEs, and these are often computationally expensive to solve. This is one of the main drivers of large-scale computing development. There are many well-established PDE solvers, however they are often inherently sequential. In consequence, there is a need to redesign the existing algorithms, and to develop new methods optimised for advanced parallel architectures. This task is challenging due to the need to identify and exploit opportunities for parallelism, and to deal with communication overheads. Moreover, a wide range of parallel platforms are available — interoperability issues arise if these are employed to work together. This thesis offers several contributions. First, performance characteristics of hybrid CPU-GPU platforms are analysed in detail in three case studies. Secondly, an optimised GPU implementation of the Preconditioned Conjugate Gradients (PCG) solver is presented. Thirdly, a multi-GPU iterative solver was developed — the Distributed Block Direct Solver (DBDS). Finally, and perhaps the most significant contribution, is the innovative streaming processing for FFT-based Poisson solvers. Each of these contributions offers significant insight into the application of advanced parallel systems in scientific computing. The techniques introduced in the case studies allow us to hide most of the communication overhead on hybrid CPU-GPU platforms. The proposed PCG implementation achieves 50–68% of the theoretical GPU peak performance, and it is more than 50% faster than the state-of-the-art solution (CUSP library). DBDS follows the Block Relaxation scheme to find the solution of linear systems on hybrid CPU-GPU platforms. The convergence of DBDS has been analysed and a procedure to compute a high-quality upper bound is derived. Thanks to the novel streaming processing technique, our FFT-based Poisson solvers are the first to handle problems larger than the GPU memory, and to enable multiGPU processing with a linear speed-up. This is a significant improvement over the existing methods, which are designed to run on a single GPU, and are limited by the device memory size. Our algorithm needs only 6.9 seconds to solve a 2D Poisson problem with 2.4 billion variables (9 GB) on two Tesla C2050 GPUs (3 GB memory).

Michal Czapinski | Michal Czapinski

[1] Greg Humphreys,et al. A multigrid solver for boundary value problems using programmable graphics hardware , 2003, HWWS '03.

[2] V. Cung,et al. A scatter search based approach for the quadratic assignment problem , 1997, Proceedings of 1997 IEEE International Conference on Evolutionary Computation (ICEC '97).

[3] Michael Garland,et al. Eﬃcient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[4] Yao Zhang,et al. An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[5] Rainald Loehner,et al. Overlapping unstructured grids , 2001 .

[6] Eli Upfal,et al. Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[7] Jie Cheng,et al. Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[8] Xiaoye S. Li,et al. An overview of SuperLU: Algorithms, implementation, and user interface , 2003, TOMS.

[9] O. C. Zienkiewicz,et al. The Finite Element Method: Its Basis and Fundamentals , 2005 .

[10] Joel H. Ferziger,et al. Computational methods for fluid dynamics , 1996 .

[11] Cleve B. Moler,et al. Iterative Refinement in Floating Point , 1967, JACM.

[12] Liqiang Wang,et al. Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs , 2010, 2010 International Conference on Computational and Information Sciences.

[13] Shiming Yang,et al. The optimal relaxation parameter for the SOR method applied to the Poisson equation in any space dimensions , 2009, Appl. Math. Lett..

[14] Nectarios Koziris,et al. Optimizing sparse matrix-vector multiplication using index and value compression , 2008, CF '08.

[15] Guillaume Caumon,et al. Concurrent number cruncher: a GPU implementation of a general sparse linear solver , 2009, Int. J. Parallel Emergent Distributed Syst..

[17] Inanc Senocak,et al. An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters , 2010 .

[18] Fred W. Glover,et al. A Template for Scatter Search and Path Relinking , 1997, Artificial Evolution.

[19] Uday Bondhugula,et al. Believe it or Not! Multicore CPUs can Match GPUs for FLOP-intensive Applications , 2010 .

[20] Kyle Chand,et al. Component‐based hybrid mesh generation , 2005 .

[21] M. Hestenes,et al. Methods of conjugate gradients for solving linear systems , 1952 .

[22] Timothy A. Davis,et al. Algorithm 832: UMFPACK V4.3---an unsymmetric-pattern multifrontal method , 2004, TOMS.

[23] Robert Strzodka,et al. Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid , 2011, IEEE Transactions on Parallel and Distributed Systems.

[24] François Bodin,et al. Heterogeneous multicore parallel programming for graphics processing units , 2009 .

[25] J. Gillis,et al. Matrix Iterative Analysis , 1961 .

[26] Stefan Turek,et al. GPU acceleration of an unmodified parallel finite element Navier-Stokes solver , 2009, 2009 International Conference on High Performance Computing & Simulation.

[27] R. LeVeque. Finite Volume Methods for Hyperbolic Problems: Characteristics and Riemann Problems for Linear Hyperbolic Equations , 2002 .

[28] Nair Maria Maia de Abreu,et al. A survey for the quadratic assignment problem , 2007, Eur. J. Oper. Res..

[29] José Ranilla,et al. Neville elimination on multi- and many-core systems: OpenMP, MPI and CUDA , 2011, The Journal of Supercomputing.

[30] Andrew Lumsdaine,et al. Accelerating sparse matrix computations via data compression , 2006, ICS '06.

[31] G. Goertzel. An Algorithm for the Evaluation of Finite Trigonometric Series , 1958 .

[32] Morgan Pickering. An Introduction to Fast Fourier Transform Methods for Partial Differential Equations, with Applications , 1986 .

[33] H. Matthies,et al. Classification and Overview of Meshfree Methods , 2004 .

[34] Helmar Burkhart,et al. General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform , 2007 .

[35] F. Rendl,et al. A thermodynamically motivated simulation procedure for combinatorial optimization problems , 1984 .

[36] J. H. Wilkinson. The algebraic eigenvalue problem , 1966 .

[37] A. N. Elshafei,et al. Hospital Layout as a Quadratic Assignment Problem , 1977 .

[38] Chung-Yuan Huang,et al. Recent progress in multiblock hybrid structured and unstructured mesh generation , 1997 .

[39] T. Koopmans,et al. Assignment Problems and the Location of Economic Activities , 1957 .

[40] Michal Czapinski,et al. An effective Parallel Multistart Tabu Search for Quadratic Assignment Problem on CUDA platform , 2013, J. Parallel Distributed Comput..

[41] Bruce Hendrickson,et al. Support Theory for Preconditioning , 2003, SIAM J. Matrix Anal. Appl..

[42] B. Eng,et al. The Use of Parallel Polynomial Preconditioners in the Solution of Systems of Linear Equations , 2005 .

[43] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[44] G. Peters,et al. Iterative refinement of the solution of a positive definite system of equations , 1966 .

[45] Yao Zhang,et al. Fast tridiagonal solvers on the GPU , 2010, PPoPP '10.

[46] Timothy A. Davis,et al. Dynamic Supernodes in Sparse Cholesky Update/Downdate and Triangular Solves , 2009, TOMS.

[47] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[48] Peng Li,et al. Multigrid on GPU: tackling power grid analysis on parallel SIMT platforms , 2008, ICCAD 2008.

[49] Robert Strzodka,et al. Using GPUs to improve multigrid solver performance on a cluster , 2008, Int. J. Comput. Sci. Eng..

[50] Vivek Sarkar,et al. JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA , 2009, Euro-Par.

[51] Nouredine Melab,et al. Parallel Local Search on GPU , 2009 .

[52] Stephen A. Jarvis,et al. Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark , 2011, PERV.

[53] Éric D. Taillard,et al. Robust taboo search for the quadratic assignment problem , 1991, Parallel Comput..

[54] Michael Griebel,et al. Meshfree Methods for Partial Differential Equations , 2002 .

[55] Eugenio Oñate,et al. The meshless finite element method , 2003 .

[56] Yao Zhang,et al. Scan primitives for GPU computing , 2007, GH '07.

[57] J. Ortega,et al. A multi-color SOR method for parallel computation , 1982, ICPP.

[58] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .

[59] Keith D. Underwood,et al. Analyzing the Impact of Overlap, Offload, and Independent Progress for Message Passing Interface Applications , 2005, Int. J. High Perform. Comput. Appl..

[60] Wen-mei W. Hwu,et al. MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs , 2008, LCPC.

[61] Chi-Bang Kuan,et al. Automated Empirical Optimization , 2011, Encyclopedia of Parallel Computing.

[62] Jack J. Dongarra,et al. An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..

[63] José Miguel Mantas,et al. An MPI-CUDA implementation of an improved Roe method for two-layer shallow water systems , 2012, J. Parallel Distributed Comput..

[64] Hee-Seok Kim,et al. A Scalable Tridiagonal Solver for GPUs , 2011, 2011 International Conference on Parallel Processing.

[65] Jack J. Dongarra,et al. Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[66] Zvi Drezner,et al. A New Genetic Algorithm for the Quadratic Assignment Problem , 2003, INFORMS J. Comput..

[67] Tamara G. Kolda,et al. An overview of the Trilinos project , 2005, TOMS.

[68] S. Kaniel. Estimates for Some Computational Techniques - in Linear Algebra , 1966 .

[69] Jack J. Dongarra,et al. Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems , 2011, ICCS.

[70] Timothy A. Davis,et al. Modifying a Sparse Cholesky Factorization , 1999, SIAM J. Matrix Anal. Appl..

[71] Y. Saad,et al. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[72] R. Fletcher. Conjugate gradient methods for indefinite systems , 1976 .

[73] Olaf Schenk,et al. Solving unsymmetric sparse systems of linear equations with PARDISO , 2004, Future Gener. Comput. Syst..

[74] R. Eymard,et al. Finite Volume Methods , 2019, Computational Methods for Fluid Dynamics.

[75] Jack J. Dongarra,et al. Overlapping Computation and Communication for Advection on Hybrid Parallel Computers , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[76] Nikos Chrisochoides,et al. Parallel Mesh Generation , 2006 .

[77] Yao Zhang,et al. Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[78] John K. Reid,et al. The Multifrontal Solution of Indefinite Sparse Symmetric Linear , 1983, TOMS.

[79] Timothy A. Davis,et al. A column pre-ordering strategy for the unsymmetric-pattern multifrontal method , 2004, TOMS.

[80] Jack J. Dongarra,et al. Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[81] I. Duff,et al. Direct Methods for Sparse Matrices , 1987 .

[82] M Dorigo,et al. Ant colonies for the quadratic assignment problem , 1999, J. Oper. Res. Soc..

[83] Chao-Tung Yang,et al. Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters , 2011, Comput. Phys. Commun..

[84] Chia-Jung Hsu. Numerical Heat Transfer and Fluid Flow , 1981 .

[85] Tom Shanley,et al. Infiniband Network Architecture , 2002 .

[86] Mark J. Harris,et al. Parallel Prefix Sum (Scan) with CUDA , 2011 .

[87] C. Loan. Computational Frameworks for the Fast Fourier Transform , 1992 .

[88] James Demmel,et al. LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[89] Michael T. Heath,et al. Parallel Algorithms for Sparse Linear Systems , 1991, SIAM Rev..

[90] Joseph JáJá,et al. An Optimized FFT-Based Direct Poisson Solver on CUDA GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[91] Roman Wyrzykowski,et al. Parallel Implementation of Conjugate Gradient Method on Graphics Processors , 2009, PPAM.

[92] David Connolly. An improved annealing scheme for the QAP , 1990 .

[93] Bernd Freisleben,et al. Fitness landscape analysis and memetic algorithms for the quadratic assignment problem , 2000, IEEE Trans. Evol. Comput..

[94] Jesús Carretero,et al. Reordering Algorithms for Increasing Locality on Multicore Processors , 2008, 2008 10th IEEE International Conference on High Performance Computing and Communications.

[95] Jack Dongarra,et al. 1. High-Performance Computing , 1998 .

[96] Nathan Ida,et al. Introduction to the Finite Element Method , 1997 .

[97] Chris Thompson,et al. Reducing Communication Overhead in Multi-GPU Hybrid Solver for 2D Laplace’s Equation , 2013, International Journal of Parallel Programming.

[98] B. P. Leonard,et al. The ULTIMATE conservative difference scheme applied to unsteady one-dimensional advection , 1991 .

[99] David E. Bernholdt,et al. A framework for characterizing overlap of communication and computation in parallel applications , 2008, Cluster Computing.

[100] W. Press,et al. Numerical Recipes: The Art of Scientific Computing , 1987 .

[101] David J. Evans,et al. Parallel S.O.R. iterative methods , 1984, Parallel Comput..

[102] Satoshi Matsuoka,et al. Auto-tuning 3-D FFT library for CUDA GPUs , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[103] Yutaka Ishikawa,et al. Optimization of MPI persistent communication , 2013, EuroMPI.

[104] Juan C. Heinrich,et al. The Finite Element Method: Basic Concepts And Applications , 1992 .

[105] Gilbert Laporte,et al. A Combinatorial Optimization Problem Arising in Dartboard Design , 1991 .

[106] C. Lanczos. Solution of Systems of Linear Equations by Minimized Iterations1 , 1952 .

[107] Luc Giraud,et al. A Parallel Distributed Fast 3D Poisson Solver for Méso-NH , 1999, Euro-Par.

[108] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[109] Kurt M. Anstreicher,et al. The Steinberg Wiring Problem , 2004, The Sharpest Cut.

[110] Chihiro Iwamura,et al. An efficient algebraic multigrid preconditioned conjugate gradient solver , 2003 .

[111] W. Cheney,et al. Numerical analysis: mathematics of scientific computing (2nd ed) , 1991 .

[112] José M. F. Moura,et al. Algebraic Signal Processing Theory: Cooley–Tukey Type Algorithms for DCTs and DSTs , 2007, IEEE Transactions on Signal Processing.

[113] Phillip Colella,et al. Advanced 3D Poisson solvers and particle-in-cell methods for accelerator modeling , 2005 .

[114] G. Amdhal,et al. Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[115] F. Magoulès,et al. An optimized Schwarz method with two‐sided Robin transmission conditions for the Helmholtz equation , 2007 .

[116] H. V. D. Vorst,et al. The rate of convergence of Conjugate Gradients , 1986 .

[117] Dulcenéia Becker. Parallel unstructured solvers for linear partial differential equations , 2006 .

[118] Thomas Stützle,et al. ACO algorithms for the quadratic assignment problem , 1999 .

[119] Roger W. Hockney,et al. A Fast Direct Solution of Poisson's Equation Using Fourier Analysis , 1965, JACM.

[120] P. Sonneveld. CGS, A Fast Lanczos-Type Solver for Nonsymmetric Linear systems , 1989 .

[121] Kevin Skadron,et al. A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[122] Roberto Battiti,et al. The Reactive Tabu Search , 1994, INFORMS J. Comput..

[123] C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators , 1950 .

[124] Timothy A. Davis,et al. An Unsymmetric-pattern Multifrontal Method for Sparse Lu Factorization , 1993 .

[125] J. Tukey,et al. An algorithm for the machine calculation of complex Fourier series , 1965 .

[126] M. Saunders,et al. Solution of Sparse Indefinite Systems of Linear Equations , 1975 .

[127] James Demmel,et al. the Parallel Computing Landscape , 2022 .

[128] Nicholas I. M. Gould,et al. A numerical evaluation of sparse direct solvers for the solution of large sparse symmetric linear systems of equations , 2007, TOMS.

[129] Harold S. Stone,et al. An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations , 1973, JACM.

[130] Sébastien Loisel,et al. On the Convergence of Optimized Schwarz Methods by way of Matrix Analysis , 2009 .

[131] Michael J. Flynn,et al. Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[132] Bryan Schauer. Multicore Processors - A Necessity , 2008 .

[133] K. Atkinson. Elementary numerical analysis , 1985 .

[134] James Demmel,et al. Applied Numerical Linear Algebra , 1997 .

[135] Torsten Hoefler,et al. Optimizing a conjugate gradient solver with non-blocking collective operations , 2007, Parallel Comput..

[136] Alan H. Karp,et al. Measuring parallel processor performance , 1990, CACM.

[137] Robert Strzodka,et al. Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations , 2007, Int. J. Parallel Emergent Distributed Syst..

[138] R. Dolbeau,et al. HMPP TM : A Hybrid Multi-core Parallel Programming Environment , 2022 .

[139] J. Dongarra,et al. Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[140] Orion S. Lawlor,et al. Message passing for GPGPU clusters: CudaMPI , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[141] Y. Saad,et al. Iterative solution of linear systems in the 20th century , 2000 .

[142] Jing Wu,et al. Optimized strategies for mapping three-dimensional FFTs onto CUDA GPUs , 2012, 2012 Innovative Parallel Computing (InPar).

[143] Satoshi Matsuoka,et al. Fast Conjugate Gradients with Multiple GPUs , 2009, ICCS.

[144] M. Benzi. Preconditioning techniques for large linear systems: a survey , 2002 .

[145] James Demmel,et al. A Supernodal Approach to Sparse Partial Pivoting , 1999, SIAM J. Matrix Anal. Appl..

[146] Martin J. Gander,et al. Optimized Schwarz Methods , 2006, SIAM J. Numer. Anal..

[147] Jack J. Dongarra,et al. Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy , 2008, TOMS.

[148] Jakob Krarup,et al. Computer-aided layout design , 1978 .

[149] William L. Briggs,et al. A multigrid tutorial, Second Edition , 2000 .

[150] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[151] Jack Dongarra,et al. PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[152] Fred W. Glover,et al. Future paths for integer programming and links to artificial intelligence , 1986, Comput. Oper. Res..

[153] Gene H. Golub,et al. Matrix computations , 1983 .

[154] Timothy A. Davis,et al. A combined unifrontal/multifrontal method for unsymmetric sparse matrices , 1999, TOMS.

[155] James Reinders,et al. Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[156] Michal Czapinski,et al. Tabu Search with two approaches to parallel flowshop evaluation on CUDA platform , 2011, J. Parallel Distributed Comput..

[157] Leon Steinberg,et al. The Backboard Wiring Problem: A Placement Algorithm , 1961 .

[158] Anne Greenbaum,et al. Approximating the inverse of a matrix for use in iterative algorithms on vector processors , 1979, Computing.

[159] Dean G. Duffy,et al. Transform Methods for Solving Partial Differential Equations , 2004 .

[160] Mark Frederick Hoemmen,et al. An Overview of Trilinos , 2003 .

[161] Robert Strzodka,et al. Exploring weak scalability for FEM calculations on a GPU-enhanced cluster , 2007, Parallel Comput..

[162] Weihang Zhu,et al. SIMD tabu search for the quadratic assignment problem with graphics hardware acceleration , 2010 .

[163] J. Shewchuk. An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[164] Eric Darve,et al. Large calculation of the flow over a hypersonic vehicle using a GPU , 2008, J. Comput. Phys..

[165] Christopher P. Thompson,et al. A Novel, Parallel PDE Solver for Unstructured Grids , 2005, LSSC.

[166] Andreas Koch,et al. A Fast GPU Implementation for Solving Sparse Ill-Posed Linear Equation Systems , 2009, PPAM.

[167] Michael J. Quinn,et al. Parallel programming in C with MPI and OpenMP , 2003 .

[168] L. Giraud,et al. Algebraic Domain Decomposition Preconditioners , 2006 .

[169] Juliane Junker. Finite Elements For Analysis And Design , 2016 .

[170] Rob H. Bisseling,et al. Accelerating a barotropic ocean model using a GPU , 2012 .

[171] Petter E. Bjørstad. Multiplicative And Additive Schwarz' Methods: Convergence In The 2-Domain Case , 1989 .

[172] Fred W. Glover,et al. Multistart Tabu Search and Diversification Strategies for the Quadratic Assignment Problem , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[173] Christina Freytag,et al. Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[174] Fábio Henrique Pereira,et al. A fast algebraic multigrid preconditioned conjugate gradient solver , 2006, Appl. Math. Comput..

[175] Rajeev Thakur,et al. Test suite for evaluating performance of multithreaded MPI communication , 2009, Parallel Comput..

[176] R. Temam,et al. Navier-Stokes equations: theory and numerical analysis: R. Teman North-Holland, Amsterdam and New York. 1977. 454 pp. US $45.00 , 1978 .

[177] Christoph W. Kessler,et al. Practical PRAM programming , 2000, Wiley series on parallel and distributed computing.

[178] John W. Dickey,et al. Campus building arrangement using topaz , 1972 .

[179] Hui Wu,et al. Parallelizing SOR for GPGPUs using alternate loop tiling , 2012, Parallel Comput..

[180] Jens H. Krüger,et al. A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[181] J. Demmel,et al. Sun Microsystems , 1996 .

[182] F. Browder,et al. Partial Differential Equations in the 20th Century , 1998 .

[183] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[184] Rohit Chandra,et al. Parallel programming in openMP , 2000 .

[185] George Havas,et al. On the worst-case complexity of integer Gaussian elimination , 1997, ISSAC.

[186] Antoine Petitet,et al. Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005 .

[187] D FalgoutRobert. An Introduction to Algebraic Multigrid , 2006 .

[188] Zhen Wang,et al. Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs , 2012, ArXiv.

[189] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[190] R. Freund,et al. QMR: a quasi-minimal residual method for non-Hermitian linear systems , 1991 .

[191] Richard Barrett,et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[192] Sven Rahmann,et al. Microarray Layout as Quadratic Assignment Problem , 2006, German Conference on Bioinformatics.

[193] Chenhan D. Yu,et al. A CPU-GPU hybrid approach for the unsymmetric multifrontal method , 2011, Parallel Comput..

[194] Ninghui Sun,et al. SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[195] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[196] Weeratunge Malalasekera,et al. An introduction to computational fluid dynamics - the finite volume method , 2007 .

[197] David H. Bailey,et al. The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[198] D. Young. Iterative methods for solving partial difference equations of elliptic type , 1954 .

[199] Timothy A. Davis,et al. Multiple-Rank Modifications of a Sparse Cholesky Factorization , 2000, SIAM J. Matrix Anal. Appl..

[200] Patrick M. Knupp,et al. Fundamentals of Grid Generation , 2020 .

[201] Franz Rendl,et al. QAPLIB – A Quadratic Assignment Problem Library , 1997, J. Glob. Optim..

[202] Anamitra R. Choudhury,et al. Multifrontal Factorization of Sparse SPD Matrices on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[203] Henk A. van der Vorst,et al. Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems , 1992, SIAM J. Sci. Comput..

[204] Gene Poole,et al. Accelerating the ANSYS Direct Sparse Solver with GPUs , 2011 .

[205] William Gropp,et al. Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[206] Jitendra Malik,et al. Scale-Space and Edge Detection Using Anisotropic Diffusion , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[207] Timothy A. Davis,et al. Row Modifications of a Sparse Cholesky Factorization , 2005, SIAM J. Matrix Anal. Appl..

[208] YANQING CHEN,et al. Algorithm 8 xx : CHOLMOD , supernodal sparse Cholesky factorization and update / downdate ∗ , 2006 .

[209] J. Grcar. How ordinary elimination became Gaussian elimination , 2009, 0907.2397.

[210] W. Arnoldi. The principle of minimized iterations in the solution of the matrix eigenvalue problem , 1951 .

[211] Sathish S. Vadhiyar,et al. An efficient MPI_allgather for grids , 2007, HPDC '07.

[212] Thomas Stützle,et al. Iterated local search for the quadratic assignment problem , 2006, Eur. J. Oper. Res..

[213] Rajesh Bordawekar,et al. Optimizing Sparse Matrix-Vector Multiplication on GPUs , 2009 .

[214] Louis A. Hageman,et al. Iterative Solution of Large Linear Systems. , 1971 .

[215] David S. Johnson,et al. Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .