Asynchronous and multithreaded communications on irregular applications using vectorized divide and conquer approach

Abstract The evolution of hardware architectures driven by the increasing requirement for performance and energy efficiency has led to complex HPC systems. In the context of Finite Element Methods, exposing massive parallelism on unstructured mesh computations with efficient load balancing and minimal synchronizations is challenging. Several parallelization strategies have to be combined together to exploit the multiple levels of parallelism. We propose several contributions aimed at addressing irregular codes and data structures in an efficient way. We have developed a hybrid parallelization approach based on the Divide & Conquer (DC it achieves an excellent parallel efficiency of 96%, and up to 6 . 56 × speedup compared to pure MPI.

[1]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[2]  Ryan Newton,et al.  A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops , 2011, IEEE Software.

[3]  Alex Pothen,et al.  ColPack: Software for graph coloring and related problems in scientific computing , 2013, TOMS.

[4]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[5]  Eric Darve,et al.  Assembly of finite element methods on graphics processors , 2011 .

[6]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[7]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[8]  David A. Ham,et al.  Finite element assembly strategies on multi‐core and many‐core architectures , 2013 .

[9]  Dirk Schmidl,et al.  Assessing the Performance of OpenMP Programs on the Intel Xeon Phi , 2013, Euro-Par.

[10]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[11]  Wim Vanroose,et al.  Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm , 2014, Parallel Comput..

[12]  Eric Petit,et al.  Divide and Conquer Parallelization of Finite Element Method Assembly , 2013, PARCO.

[13]  Guy E. Blelloch,et al.  Parallel algorithms , 1996, CSUR.

[14]  Charbel Farhat,et al.  A general approach to nonlinear FE computations on shared-memory multiprocessors , 1989 .

[15]  Paul H. J. Kelly,et al.  Thread Parallelism for Highly Irregular Computation in Anisotropic Mesh Adaptation , 2015, ArXiv.

[16]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[17]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[18]  Nicolas Gourdain,et al.  High performance parallel computing of flows in complex geometries: I. Methods , 2009 .

[19]  David Goudin,et al.  A Scalable Parallel Assembly for Irregular Meshes Based on a Block Distribution for a Parallel Block Direct Solver , 2000, PARA.

[20]  Paul H. J. Kelly,et al.  International Conference on Computational Science , ICCS 2012 Hybrid OpenMP / MPI anisotropic mesh smoothing , 2012 .

[21]  Pradeep Dubey,et al.  Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Laura Grigori,et al.  Parallel design and performance of nested filtering factorization preconditioner , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[24]  Gihan R. Mudalige,et al.  Vectorizing unstructured mesh computations for many‐core architectures , 2016, Concurr. Comput. Pract. Exp..

[25]  R. Singleton An algorithm for computing the mixed radix fast Fourier transform , 1969 .

[26]  Lawrence Mitchell,et al.  Developing a scalable hybrid MPI/OpenMP unstructured finite element model , 2015 .

[27]  Eric Petit,et al.  Task-Based Parallelization of Unstructured Meshes Assembly Using D&C Strategy , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[28]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[29]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[30]  Mirko Rahn,et al.  The GASPI API: A Failure Tolerant PGAS API for Asynchronous Dataflow on Heterogeneous Architectures , 2015 .

[31]  Victor Eijkhout,et al.  Recursive approach in sparse matrix LU factorization , 2001, Sci. Program..

[32]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[33]  Mithuna Thottethodi,et al.  Recursive Array Layouts and Fast Matrix Multiplication , 2002, IEEE Trans. Parallel Distributed Syst..

[34]  Quang Dinh,et al.  A Case Study on Using a Proto-Application as a Proxy for Code Modernization , 2015, ICCS.

[35]  Ellis Horowitz,et al.  Divide-and-Conquer for Parallel Processing , 1983, IEEE Transactions on Computers.

[36]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[37]  A. George Nested Dissection of a Regular Finite Element Mesh , 1973 .

[38]  G. R. Mudalige,et al.  OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures , 2012, 2012 Innovative Parallel Computing (InPar).