Communication-Computation Overlapping for Preconditioned Parallel Iterative Solvers with Dynamic Loop Scheduling

Preconditioned parallel solvers based on the Krylov iterative method are widely used in scientific and engineering applications. Communication overhead is a critical issue when executing these solvers on large-scale massively parallel supercomputers. In the previous work, we introduced communication-computation overlapping with dynamic loop scheduling of OpenMP to the sparse matrix-vector multiplication (SpMV) process of a parallel iterative solver by Conjugate Gradient (CG) method in a parallel finite element application (GeoFEM/Cube) on multicore and manycore clusters. In the present work, first, we re-evaluated the method on our new system, Wisteria/BDEC-01 (Odyssey) (Fujitsu PRIMEHPC FX1000 with A64FX), and a significant performance improvement of 25-30% for parallel iterative solver at 2,048 nodes (98,304 cores) was obtained. Moreover, we proposed a new reordering method for communication-computation overlapping in ICCG solvers for a parallel finite volume application (Poisson3D/Dist), and attained 5-12% improvement at 1,024 nodes of Odyssey.

[1]  James Demmel,et al.  Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[2]  Nakajima Kengo,et al.  Performance Evaluation of Pipelined CG Method , 2016 .

[3]  Taisuke Boku,et al.  Performance and Scalability of Lightweight Multi-kernel Based Operating Systems , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[4]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[5]  Wim Vanroose,et al.  Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm , 2014, Parallel Comput..

[6]  K. Nakajima Parallel Iterative Solvers of GeoFEM with Selective Blocking Preconditioning for Nonlinear Contact Problems on the Earth Simulator , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[7]  Kengo Nakajima Optimization of serial and parallel communications for parallel geometric multigrid method , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[8]  Satoshi Matsuoka,et al.  Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Toshihiro Hanawa,et al.  Communication-Computation Overlapping with Dynamic Loop Scheduling for Preconditioned Parallel Iterative Solvers on Multicore and Manycore Clusters , 2017, 2017 46th International Conference on Parallel Processing Workshops (ICPPW).

[10]  Hiroshi Okuda,et al.  Parallel Iterative Solvers for Unstructured Grids Using an OpenMP/MPI Hybrid Programming Model for the GeoFEM Platform on SMP Cluster Architectures , 2002, ISHPC.

[11]  Barry F. Smith,et al.  Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations , 1996 .

[12]  Yutaka Ishikawa,et al.  On the Scalability, Performance Isolation and Device Driver Transparency of the IHK/McKernel Hybrid Lightweight Kernel , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[13]  Gerhard Wellein,et al.  A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units , 2013, SIAM J. Sci. Comput..

[14]  Yutaka Ishikawa,et al.  Parallel Multigrid Methods on Manycore Clusters with IHK/McKernel , 2019, 2019 IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA).

[16]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.