论文信息 - Fork-Join and Data-Driven Execution Models on Multi-core Architectures: Case Study of the FMM

Fork-Join and Data-Driven Execution Models on Multi-core Architectures: Case Study of the FMM

Extracting maximum performance of multi-core architectures is a difficult task primarily due to bandwidth limitations of the memory subsystem and its complex hierarchy. In this work, we study the implications of fork-join and data-driven execution models on this type of architecture at the level of task parallelism. For this purpose, we use a highly optimized fork-join based implementation of the FMM and extend it to a data-driven implementation using a distributed task scheduling approach. This study exposes some limitations of the conventional fork-join implementation in terms of synchronization overheads. We find that these are not negligible and their elimination by the data-driven method, with a careful data locality strategy, was beneficial. Experimental evaluation of both methods on state-of-the-art multi-socket multi-core architectures showed up to 22% speed-ups of the data-driven approach compared to the original method. We demonstrate that a data-driven execution of FMM not only improves performance by avoiding global synchronization overheads but also reduces the memory-bandwidth pressure caused by memory-intensive computations.

[1] Samuel Williams,et al. Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2] Cédric Augonnet,et al. StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines , 2010 .

[3] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[4] Michael M. Resch,et al. Tools for High Performance Computing - Proceedings of the 2nd International Workshop on Parallel Tools for High Performance Computing, July 2008, HLRS, Stuttgart , 2008, Parallel Tools Workshop.

[5] Lars Bergstrom,et al. Measuring NUMA effects with the STREAM benchmark , 2011, ArXiv.

[6] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[7] L. Greengard. The Rapid Evaluation of Potential Fields in Particle Systems , 1988 .

[8] Samuel Williams,et al. Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[9] Piet Hut,et al. A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[10] Lexing Ying,et al. A New Parallel Kernel-Independent Fast Multipole Method , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[11] Walter Dehnen,et al. A Hierarchical O(N) Force Calculation Algorithm , 2002 .

[12] Hatem Ltaief,et al. Data‐driven execution of fast multipole methods , 2012, Concurr. Comput. Pract. Exp..

[13] Rio Yokota,et al. Petascale turbulence simulation using a highly parallel fast multipole method on GPUs , 2011, Comput. Phys. Commun..

[14] Kenjiro Taura,et al. A Task Parallel Implementation of Fast Multipole Methods , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[15] Matsuoka Satoshi,et al. Towards a Dataflow FMM using the OmpSs Programming Model , 2012 .

[16] D. Zorin,et al. A kernel-independent adaptive fast multipole algorithm in two and three dimensions , 2004 .

[17] David Padua,et al. Encyclopedia of Parallel Computing , 2011 .

[18] Alejandro Duran,et al. Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[19] Matthias S. Müller,et al. The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[20] Stéphanie Chaillat,et al. A multi-level fast multipole BEM for 3-D elastodynamics in the frequency domain , 2008 .

[21] Emmanuel Agullo,et al. Pipelining the Fast Multipole Method over a Runtime System , 2012, CSE 2012.

[22] Richard W. Vuduc,et al. Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[23] Jack Dongarra,et al. QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .