Reducing latency cost in 2D sparse matrix partitioning models

Abstract Sparse matrix partitioning is a common technique used for improving performance of parallel linear iterative solvers. Compared to solvers used for symmetric linear systems, solvers for nonsymmetric systems offer more potential for addressing different multiple communication metrics due to the flexibility of adopting different partitions on the input and output vectors of sparse matrix-vector multiplication operations. In this regard, there exist works based on one-dimensional (1D) and two-dimensional (2D) fine-grain partitioning models that effectively address both bandwidth and latency costs in nonsymmetric solvers. In this work, we propose two new models based on 2D checkerboard and jagged partitioning. These models aim at minimizing total message count while maintaining a balance on communication volume loads of processors; hence, they address both bandwidth and latency costs. We evaluate all partitioning models on two nonsymmetric system solvers implemented using the widely adopted PETSc toolkit and conduct extensive experiments using these solvers on a modern system (a BlueGene/Q machine) successfully scaling them up to 8K processors. Along with the proposed models, we put practical aspects of eight evaluated models (two 1D- and six 2D-based) under thorough analysis. To the best of our knowledge, this is the first work that analyzes practical performance of 2D models on this scale. Among evaluated models, the models that rely on 2D jagged partitioning obtain the most promising results by striking a balance between minimizing bandwidth and latency costs.

[1]  Gérard Meurant Multitasking the conjugate gradient method on the CRAY X-MP/48 , 1987, Parallel Comput..

[2]  Sanjay Ranka,et al.  Parallel Incremental Graph Partitioning , 1997, IEEE Trans. Parallel Distributed Syst..

[3]  Ümit V. Çatalyürek,et al.  Decomposing Irregularly Sparse Matrices for Parallel Matrix-Vector Multiplication , 1996, IRREGULAR.

[4]  Steven J. Plimpton,et al.  An Efficient Parallel Algorithm for Matrix-Vector Multiplication , 1995, Int. J. High Speed Comput..

[5]  Rob H. Bisseling,et al.  Communication balancing in parallel sparse matrix-vector multiplication , 2005 .

[6]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[7]  H. Elman Iterative methods for large, sparse, nonsymmetric systems of linear equations , 1982 .

[8]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[9]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[10]  Olivier C. Martin,et al.  Partitioning of unstructured meshes for load balancing , 1995, Concurr. Pract. Exp..

[11]  Brendan Vastenhouw,et al.  A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication , 2005, SIAM Rev..

[12]  Laura Grigori,et al.  Communication Avoiding ILU0 Preconditioner , 2015, SIAM J. Sci. Comput..

[13]  Bora Uçar,et al.  Revisiting Hypergraph Models for Sparse Matrix Partitioning , 2007, SIAM Rev..

[14]  David A. Patterson,et al.  Latency lags bandwith , 2004, CACM.

[15]  Graham F. Carey,et al.  Parallelizable Restarted Iterative Methods for Nonsymmetric Linear Systems , 1991, PPSC.

[16]  Anthony T. Chronopoulos,et al.  s-step iterative methods for symmetric linear systems , 1989 .

[17]  G. Karypis,et al.  Multilevel k-way hypergraph partitioning , 1999, Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361).

[18]  William Aiello,et al.  Sparse Matrix Computations on Parallel Processor Arrays , 1993, SIAM J. Sci. Comput..

[19]  James Demmel,et al.  Avoiding Communication in Two-Sided Krylov Subspace Methods , 2011 .

[20]  James Demmel,et al.  Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[21]  Xing-Ping Liu,et al.  An improved parallel hybrid bi-conjugate gradient method suitable for distributed parallel computing , 2009 .

[22]  R. Freund,et al.  QMR: a quasi-minimal residual method for non-Hermitian linear systems , 1991 .

[23]  L.T. Yang,et al.  The improved BiCGStab method for large and sparse unsymmetric linear systems on parallel distributed memory architectures , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[24]  Y. Saad,et al.  Practical Use of Polynomial Preconditionings for the Conjugate Gradient Method , 1985 .

[25]  Mehmet Deveci,et al.  UMPa: A multi-objective, multi-level partitioner for communication minimization , 2012, Graph Partitioning and Graph Clustering.

[26]  Ümit V. Çatalyürek Hypergraph models for sparse matrix partitioning and reordering , 1999 .

[27]  J. G. Lewis,et al.  Distributed memory matrix-vector multiplication and conjugate gradient algorithms , 1993, Supercomputing '93.

[28]  Tamara G. Kolda,et al.  Graph partitioning models for parallel computing , 2000, Parallel Comput..

[29]  Bora Uçar,et al.  Minimizing Communication Cost in Fine-Grain Partitioning of Sparse Matrices , 2003, ISCIS.

[30]  Victor Eijkhout,et al.  LAPACK Working Note 56: Reducing Communication Costs in the Conjugate Gradient Algorithm on Distributed Memory Multiprocessors , 1993 .

[31]  Hong Zhang,et al.  Hierarchical Krylov and nested Krylov methods for extreme-scale computing , 2014, Parallel Comput..

[32]  Claude Berge,et al.  Graphs and Hypergraphs , 2021, Clustering.

[33]  G. Golub,et al.  Iterative solution of linear systems , 1991, Acta Numerica.

[34]  Anthony T. Chronopoulos s-Step Iterative Methods for (Non) Symmetric (In) Definite Linear Systems , 1989, PPSC.

[35]  Bora Uçar,et al.  Partitioning Sparse Matrices for Parallel Preconditioned Iterative Methods , 2007, SIAM J. Sci. Comput..

[36]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[37]  Cevdet Aykanat,et al.  A Novel Method for Scaling Iterative Solvers: Avoiding Latency Overhead of Parallel Sparse-Matrix Vector Multiplies , 2015, IEEE Transactions on Parallel and Distributed Systems.

[38]  Kamesh Madduri,et al.  Parallel breadth-first search on distributed memory systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[39]  Sandia Report,et al.  Toward a New Metric for Ranking High Performance Computing Systems , 2013 .

[40]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[41]  Tijmen P. Collignon,et al.  Minimizing synchronization in IDR (s) , 2011, Numer. Linear Algebra Appl..

[42]  Ümit V. Çatalyürek,et al.  Permuting Sparse Rectangular Matrices into Block-Diagonal Form , 2004, SIAM J. Sci. Comput..

[43]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[44]  James Demmel,et al.  Parallel numerical linear algebra , 1993, Acta Numerica.

[45]  H. Martin Bücker,et al.  A Parallel Version of the Unsymmetric Lanczos Algorithm and its Application to QMR , 1996 .

[46]  Bora Uçar,et al.  Encapsulating Multiple Communication-Cost Metrics in Partitioning Sparse Rectangular Matrices for Parallel Matrix-Vector Multiplies , 2004, SIAM J. Sci. Comput..

[47]  Ümit V. Çatalyürek,et al.  A fine-grain hypergraph model for 2D decomposition of sparse matrices , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[48]  Jorge J. Moré,et al.  Digital Object Identifier (DOI) 10.1007/s101070100263 , 2001 .

[49]  L. Reichel,et al.  A Newton basis GMRES implementation , 1994 .

[50]  W. Joubert,et al.  Parallelizable restarted iterative methods for nonsymmetric linear systems. part I: Theory , 1992 .

[51]  Wim Vanroose,et al.  Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines , 2013, SIAM J. Sci. Comput..

[52]  Torsten Hoefler,et al.  Optimizing a conjugate gradient solver with non-blocking collective operations , 2007, Parallel Comput..

[53]  Berkant Barla Cambazoglu,et al.  Multi-level direct K-way hypergraph partitioning with multiple constraints and fixed vertices , 2008, J. Parallel Distributed Comput..

[54]  Thomas Lengauer,et al.  Combinatorial algorithms for integrated circuit layout , 1990, Applicable theory in computer science.

[55]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: applications in VLSI domain , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[56]  Ümit V. Çatalyürek,et al.  Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication , 1999, IEEE Trans. Parallel Distributed Syst..

[57]  Bora Uçar,et al.  On Two-Dimensional Sparse Matrix Partitioning: Models, Methods, and a Recipe , 2010, SIAM J. Sci. Comput..

[58]  Andy B. Yoo,et al.  A scalable eigensolver for large scale-free graphs using 2D graph partitioning , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[59]  Wim Vanroose,et al.  Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm , 2014, Parallel Comput..

[60]  Ümit V. Çatalyürek,et al.  A Hypergraph-Partitioning Approach for Coarse-Grain Decomposition , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[61]  Anthony T. Chronopoulos,et al.  Parallel Iterative S-Step Methods for Unsymmetric Linear Systems , 1996, Parallel Comput..

[62]  Steven J. Plimpton,et al.  Massively parallel methods for engineering and science problems , 1994, CACM.

[63]  H. V. D. Vorst,et al.  Reducing the effect of global communication in GMRES( m ) and CG on parallel distributed memory computers , 1995 .

[64]  Laurence T. Yang,et al.  The improved BiCG method for large and sparse linear systems on parallel distributed memory architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[65]  Shang-Hua Teng,et al.  How Good is Recursive Bisection? , 1997, SIAM J. Sci. Comput..

[66]  Jean Roman,et al.  SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.

[67]  Lloyd N. Trefethen,et al.  How Fast are Nonsymmetric Matrix Iterations? , 1992, SIAM J. Matrix Anal. Appl..

[68]  Sivasankaran Rajamanickam,et al.  Scalable matrix computations on large scale-free graphs using 2D graph partitioning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).