论文信息 - Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices

Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices

A new sparse high performance conjugate gradient benchmark (HPCG) has been recently released to address challenges in the design of sparse linear solvers for the next generation extreme-scale computing systems. Key computation, data access, and communication pattern in HPCG represent building blocks commonly found in today's HPC applications. While it is a well known challenge to efficiently parallelize Gauss-Seidel smoother, the most time-consuming kernel in HPCG, our algorithmic and architecture-aware optimizations deliver 95% and 68% of the achievable bandwidth on Xeon and Xeon Phi, respectively. Based on available parallelism, our Xeon Phi shared-memory implementation of Gauss-Seidel smoother selectively applies block multi-color reordering. Combined with MPI parallelization, our implementation balances parallelism, data access locality, CG convergence rate, and communication overhead. Our implementation achieved 580 TFLOPS (82% parallelization efficiency) on Tianhe-2 system, ranking first on the most recent HPCG list in July 2014. In addition, we demonstrate that our optimizations not only benefit HPCG original dataset, which is based on structured 3D grid, but also a wide range of unstructured matrices.

[1] Alex Pothen,et al. New Multithreaded Ordering and Coloring Algorithms for Multicore Architectures , 2011, Euro-Par.

[2] Luke N. Olson,et al. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[3] Arutyun Avetisyan,et al. Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[4] Rudolf Eigenmann,et al. Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems , 2008, ICS '08.

[5] George Karypis,et al. Multi-threaded Graph Partitioning , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[6] Erik Hagersten,et al. Multigrid and Gauss-Seidel smoothers revisited: parallelization on chip multiprocessors , 2006, ICS '06.

[7] GrinspunEitan,et al. Sparse matrix solvers on the GPU , 2003 .

[8] Joel H. Saltz,et al. Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors , 1990, SIAM J. Sci. Comput..

[9] Sandia Report,et al. Toward a New Metric for Ranking High Performance Computing Systems , 2013 .

[10] Victor Eijkhout,et al. An iterative solver benchmark , 2001, Sci. Program..

[11] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .

[12] E. L. Poole,et al. Multicolor ICCG methods for vector computers , 1987 .

[13] Anoop Gupta,et al. Parallel ICCG on a hierarchical memory multiprocessor - Addressing the triangular solve bottleneck , 1990, Parallel Comput..

[14] Pradeep Dubey,et al. Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver , 2014, ISC.

[15] Alex Pothen,et al. ColPack: Software for graph coloring and related problems in scientific computing , 2013, TOMS.

[16] Yousef Saad,et al. GPU-accelerated preconditioned iterative linear solvers , 2013, The Journal of Supercomputing.

[17] Eitan Grinspun,et al. Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.

[18] LiRuipeng,et al. GPU-accelerated preconditioned iterative linear solvers , 2013 .

[19] Santa Clara,et al. Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU , 2011 .

[20] Matthias S. Müller,et al. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[21] Todd Gamblin,et al. Scaling Algebraic Multigrid Solvers: On the Road to Exascale , 2010, CHPC.

[22] Takeshi Iwashita,et al. Block Red-Black Ordering: A New Ordering Strategy for Parallelization of ICCG Method , 2004, International Journal of Parallel Programming.

[23] Alex Pothen,et al. What Color Is Your Jacobian? Graph Coloring for Computing Derivatives , 2005, SIAM Rev..

[24] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[25] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[26] H. Elman,et al. Ordering techniques for the preconditioned conjugate gradient method on parallel computers , 1989 .

[27] Samuel Williams,et al. Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[28] V. E. Henson,et al. BoomerAMG: a parallel algebraic multigrid solver and preconditioner , 2002 .

[29] Yousef Saad,et al. Solving Sparse Triangular Linear Systems on Parallel Computers , 1989, Int. J. High Speed Comput..

[30] Cevdet Aykanat,et al. Fast optimal load balancing algorithms for 1D partitioning , 2004, J. Parallel Distributed Comput..

[31] Hiroshi Nakashima,et al. Algebraic Block Multi-Color Ordering Method for Parallel Multi-Threaded Sparse Triangular Solver in ICCG Method , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[32] Xing Liu,et al. Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[33] Jan-Philipp Weiss,et al. Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs -- The Power(q)-pattern Method , 2011 .

[34] Erik G. Boman,et al. Factors Impacting Performance of Multithreaded Sparse Triangular Solve , 2010, VECPAR.