Parallelization of Hierarchical Matrix Algorithms for Electromagnetic Scattering Problems

Numerical solution methods for electromagnetic scattering problems lead to large systems of equations with millions or even billions of unknown variables. The coefficient matrices are dense, leading to large computational costs and storage requirements if direct methods are used. A commonly used technique is to instead form a hierarchical representation for the parts of the matrix that corresponds to far-field interactions. The overall computational cost and storage requirements can then be reduced to \(\mathcal {O}(N\log N)\). This still corresponds to a large-scale simulation that requires parallel implementation. The hierarchical algorithms are rather complex, both regarding data dependencies and communication patterns, making parallelization non-trivial. In this chapter, we describe two classes of algorithms in some detail, we provide a survey of existing solutions, we show results for a proof-of-concept implementation, and we provide various perspectives on different aspects of the problem.

[1]  Corinne Ancourt,et al.  An up to date Mapping Methodology for GPUs , 2018 .

[2]  Bernd Scheuermann,et al.  A Data-Flow Based Coordination Approach to Concurrent Software Engineering , 2012, 2012 Data-Flow Execution Models for Extreme Scale Computing.

[3]  Lexing Ying,et al.  A Parallel Directional Fast Multipole Method , 2013, SIAM J. Sci. Comput..

[4]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[5]  Satoshi Matsuoka,et al.  Tapas: An Implicitly Parallel Programming Framework for Hierarchical N-Body Algorithms , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[6]  Christoph W. Kessler,et al.  SkePU 2: Flexible and Type-Safe Skeleton Programming for Heterogeneous Parallel Systems , 2018, International Journal of Parallel Programming.

[7]  Hatem Ltaief,et al.  Data‐driven execution of fast multipole methods , 2012, Concurr. Comput. Pract. Exp..

[8]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[9]  Matthew G. Knepley,et al.  PetFMM—A dynamically load‐balancing parallel fast multipole library , 2009, ArXiv.

[10]  Martin Nilsson,et al.  Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 916 Fast Numerical Techniques for Electromagnetic Problems in Frequency Domain , 2003 .

[11]  Emmanuel Agullo,et al.  Task-Based FMM for Multicore Architectures , 2014, SIAM J. Sci. Comput..

[12]  Emmanuel Agullo,et al.  Bridging the Gap Between OpenMP and Task-Based Runtime Systems for the Fast Multipole Method , 2017, IEEE Transactions on Parallel and Distributed Systems.

[13]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[14]  Kathleen Knobe,et al.  Ease of use with concurrent collections (CnC) , 2009 .

[15]  Samuel Thibault,et al.  On Runtime Systems for Task-based Programming on Heterogeneous Platforms , 2018 .

[16]  Elisabeth Larsson,et al.  DuctTeip: An efficient programming model for distributed task based parallel computing , 2018, Parallel Comput..

[17]  Jiming Song,et al.  Multilevel fast multipole algorithm for electromagnetic scattering by large complex objects , 1997 .

[18]  Cyril Bordage,et al.  Parallelization on Heterogeneous Multicore and Multi-GPU Systems of the Fast Multipole Method for the Helmholtz Equation Using a Runtime System , 2012 .

[19]  Jin-Fa Lee,et al.  A fast IE-FFT algorithm for solving PEC scattering problems , 2005 .

[20]  Ludek Matyska,et al.  Optimizing CUDA code by kernel fusion: application on BLAS , 2013, The Journal of Supercomputing.

[21]  Richard W. Vuduc,et al.  A massively parallel adaptive fast-multipole method on heterogeneous architectures , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[22]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[23]  Jakub Kurzak,et al.  Massively parallel implementation of a fast multipole method for distributed memory machines , 2005, J. Parallel Distributed Comput..

[24]  Ozgur Ergul,et al.  Hierarchical parallelization of the multilevel fast multipole algorithm (MLFMA) , 2013 .

[25]  Francesca Vipiana,et al.  Nested Equivalent Source Approximation for the Modeling of Multiscale Structures , 2014, IEEE Transactions on Antennas and Propagation.

[26]  Christoph W. Kessler,et al.  Lazy Allocation and Transfer Fusion Optimization for GPU-Based Heterogeneous Systems , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[27]  Jiri Filipovic,et al.  OpenCL Kernel Fusion for GPU, Xeon Phi and CPU , 2015, 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[28]  Mohamed Wahib,et al.  Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Sverker Holmgren,et al.  Dynamic Autotuning of Adaptive Fast Multipole Methods on Hybrid Multicore CPU and GPU Systems , 2013, SIAM J. Sci. Comput..

[30]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[31]  Vivek Sarkar,et al.  Declarative aspects of memory management in the concurrent collections parallel programming model , 2009, DAMP '09.

[32]  Clemens Grelck,et al.  An Efficient Scalable Runtime System for Macro Data Flow Processing Using S-Net , 2014, International Journal of Parallel Programming.

[33]  Andrew Richards,et al.  Programmability and performance portability aspects of heterogeneous multi-/manycore systems , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[34]  Simon McIntosh-Smith,et al.  On the Performance of Parallel Tasking Runtimes for an Irregular Fast Multipole Method Application , 2017, IWOMP.

[35]  D. Wilton,et al.  Electromagnetic scattering by surfaces of arbitrary shape , 1980 .

[36]  Elisabeth Larsson,et al.  Task parallel implementation of a solver for electromagnetic scattering problems , 2018, ArXiv.

[37]  Elisabeth Larsson,et al.  Resource-Aware Task Scheduling , 2015, ACM Trans. Embed. Comput. Syst..

[38]  Vivek Sarkar,et al.  Multi-core Implementations of the Concurrent Collections Programming Model , 2008 .

[39]  Christoph W. Kessler,et al.  SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.

[40]  Giuseppe Vecchi,et al.  Wideband Fast Kernel-Independent Modeling of Large Multiscale Structures Via Nested Equivalent Source Approximation , 2015, IEEE Transactions on Antennas and Propagation.

[41]  Wei Yi,et al.  Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[42]  M. Vouvakis,et al.  The adaptive cross approximation algorithm for accelerated method of moments computations of EMC problems , 2005, IEEE Transactions on Electromagnetic Compatibility.

[43]  Afshin Zafari,et al.  TaskUniVerse: A Task-Based Unified Interface for Versatile Parallel Execution , 2017, PPAM.

[44]  Clemens Grelck,et al.  Distributed S-Net: Cluster and Grid Computing without the Hassle , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[45]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[46]  Thomas Fahringer,et al.  Adaptive Granularity Control in Task Parallel Programs Using Multiversioning , 2013, Euro-Par.

[47]  Francesca Vipiana,et al.  EFIE Modeling of High-Definition Multiscale Structures , 2010, IEEE Transactions on Antennas and Propagation.

[48]  Alexander V. Shafarenko,et al.  Coordinating Data Parallel SAC Programs with S-Net , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[49]  Richard W. Vuduc,et al.  Performance evaluation of concurrent collections on high-performance multicore computing systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[50]  Alexander V. Shafarenko,et al.  The Cost and Benefits of Coordination Programming: Two Case Studies in Concurrent Collections and S-NET , 2016, Parallel Process. Lett..

[51]  Nicholas Carriero,et al.  Coordination languages and their significance , 1992, CACM.

[52]  Michael F. P. O'Boyle,et al.  MaxPair: Enhance OpenCL Concurrent Kernel Execution by Weighted Maximum Matching , 2018, GPGPU@PPoPP.

[53]  Bo Zhang,et al.  Asynchronous Task Scheduling of the Fast Multipole Method Using Various Runtime Systems , 2014, 2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[54]  Jürgen Teich,et al.  Automatic Kernel Fusion for Image Processing DSLs , 2018, SCOPES.

[55]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[56]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[57]  Alexander V. Shafarenko,et al.  Asynchronous Stream Processing with S-Net , 2010, International Journal of Parallel Programming.

[58]  Martin Tillenius,et al.  SuperGlue: A Shared Memory Framework Using Data Versioning for Dependency-Aware Task-Based Parallelization , 2015, SIAM J. Sci. Comput..

[59]  Francesca Vipiana,et al.  A Doubly Hierarchical MoM for High-Fidelity Modeling of Multiscale Structures , 2014, IEEE Transactions on Electromagnetic Compatibility.

[60]  Eric Darve,et al.  The fast multipole method on parallel clusters, multicore processors, and graphics processing units , 2011 .

[61]  S. Velamparambil,et al.  Analysis and performance of a distributed memory multilevel fast multipole algorithm , 2005, IEEE Transactions on Antennas and Propagation.

[62]  Alexander V. Shafarenko,et al.  Parallel signal processing with S-Net , 2010, ICCS.

[63]  Petru Eles,et al.  Latency-aware packet processing on CPU-GPU heterogeneous systems , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).