Fork-Join and Data-Driven Execution Models on Multi-core Architectures: Case Study of the FMM

Extracting maximum performance of multi-core architectures is a difficult task primarily due to bandwidth limitations of the memory subsystem and its complex hierarchy. In this work, we study the implications of fork-join and data-driven execution models on this type of architecture at the level of task parallelism. For this purpose, we use a highly optimized fork-join based implementation of the FMM and extend it to a data-driven implementation using a distributed task scheduling approach. This study exposes some limitations of the conventional fork-join implementation in terms of synchronization overheads. We find that these are not negligible and their elimination by the data-driven method, with a careful data locality strategy, was beneficial. Experimental evaluation of both methods on state-of-the-art multi-socket multi-core architectures showed up to 22% speed-ups of the data-driven approach compared to the original method. We demonstrate that a data-driven execution of FMM not only improves performance by avoiding global synchronization overheads but also reduces the memory-bandwidth pressure caused by memory-intensive computations.

[1]  Samuel Williams,et al.  Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  Cédric Augonnet,et al.  StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines , 2010 .

[3]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[4]  Michael M. Resch,et al.  Tools for High Performance Computing - Proceedings of the 2nd International Workshop on Parallel Tools for High Performance Computing, July 2008, HLRS, Stuttgart , 2008, Parallel Tools Workshop.

[5]  Lars Bergstrom,et al.  Measuring NUMA effects with the STREAM benchmark , 2011, ArXiv.

[6]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[7]  L. Greengard The Rapid Evaluation of Potential Fields in Particle Systems , 1988 .

[8]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[9]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[10]  Lexing Ying,et al.  A New Parallel Kernel-Independent Fast Multipole Method , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[11]  Walter Dehnen,et al.  A Hierarchical O(N) Force Calculation Algorithm , 2002 .

[12]  Hatem Ltaief,et al.  Data‐driven execution of fast multipole methods , 2012, Concurr. Comput. Pract. Exp..

[13]  Rio Yokota,et al.  Petascale turbulence simulation using a highly parallel fast multipole method on GPUs , 2011, Comput. Phys. Commun..

[14]  Kenjiro Taura,et al.  A Task Parallel Implementation of Fast Multipole Methods , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[15]  Matsuoka Satoshi,et al.  Towards a Dataflow FMM using the OmpSs Programming Model , 2012 .

[16]  D. Zorin,et al.  A kernel-independent adaptive fast multipole algorithm in two and three dimensions , 2004 .

[17]  David Padua,et al.  Encyclopedia of Parallel Computing , 2011 .

[18]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[19]  Matthias S. Müller,et al.  The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[20]  Stéphanie Chaillat,et al.  A multi-level fast multipole BEM for 3-D elastodynamics in the frequency domain , 2008 .

[21]  Emmanuel Agullo,et al.  Pipelining the Fast Multipole Method over a Runtime System , 2012, CSE 2012.

[22]  Richard W. Vuduc,et al.  Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .