论文信息 - Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly

Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly

Exposing massive parallelism on 3D unstructured meshes computation with efficient load balancing and minimal synchronizations is challenging. Current approaches relying on domain decomposition and mesh coloring struggle to scale with the increasing number of cores per nodes, especially with new many-core processors. In this paper, we propose an hybrid approach using domain decomposition to exploit distributed memory parallelism, Divide-and-Conquer, D&C, to exploit shared memory parallelism and improve locality, and mesh coloring at core level to exploit vectors. It illustrates a new trade-off for many-cores between structuredness, memory locality, and vectorization. We evaluate our approach on the finite element matrix assembly of an industrial fluid dynamic code developed by Dassault Aviation. We compare our D&C approach to domain decomposition and to mesh coloring. D&C achieves a high parallel efficiency, a good data locality as well as an improved bandwidth usage. It competes on current nodes with the optimized pure MPI version with a minimum 10% speed-up. D&C shows an impressive 319x strong scaling on 512 cores (32 nodes) with only 2000 vertices per core. Finally, the Intel Xeon Phi version has a performance similar to 10 Intel E5-2665 Xeon Sandy Bridge cores and 95% parallel efficiency on the 60 physical cores. Running on 4 Xeon Phi (240 cores), D&C has 92% efficiency on the physical cores and performance similar to 33 Intel E5-2665 Xeon Sandy Bridge cores.

[1] Ellis Horowitz,et al. Divide-and-Conquer for Parallel Processing , 1983, IEEE Transactions on Computers.

[2] Guy E. Blelloch,et al. Programming parallel algorithms , 1996, CACM.

[3] Ryan Newton,et al. A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops , 2011, IEEE Software.

[4] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[5] Marcin Paprzycki,et al. Assembling recursively stored sparse matrices , 2010, Proceedings of the International Multiconference on Computer Science and Information Technology.

[6] Sanjay J. Patel,et al. WAYPOINT: scaling coherence to thousand-core architectures , 2010, PACT '10.

[7] E. Cuthill,et al. Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[8] Charbel Farhat,et al. A general approach to nonlinear FE computations on shared-memory multiprocessors , 1989 .

[9] Eitan Grinspun,et al. Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.

[10] Laura Grigori,et al. Parallel design and performance of nested filtering factorization preconditioner , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11] Paul H. J. Kelly,et al. International Conference on Computational Science , ICCS 2012 Hybrid OpenMP / MPI anisotropic mesh smoothing , 2012 .

[12] Pradeep Dubey,et al. Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13] Dirk Schmidl,et al. Assessing the Performance of OpenMP Programs on the Intel Xeon Phi , 2013, Euro-Par.

[14] Alex Pothen,et al. ColPack: Software for graph coloring and related problems in scientific computing , 2013, TOMS.

[15] Anoop Gupta,et al. The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[16] Christina Freytag,et al. Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[17] Eric Petit,et al. Divide and Conquer Parallelization of Finite Element Method Assembly , 2013, PARCO.