Optimization of an Electromagnetics Code with Multicore Wavefront Diamond Blocking and Multi-dimensional Intra-Tile Parallelization

Understanding and optimizing the properties of solar cells is becoming a key issue in the search for alternatives to nuclear and fossil energy sources. A theoretical analysis via numerical simulations involves solving Maxwell's Equations in discretized form and typically requires substantial computing effort. We start from a hybrid-parallel (MPI+OpenMP) production code that implements the Time Harmonic Inverse Iteration Method (THIIM) with Finite-Difference Frequency Domain (FDFD) discretization. Although this algorithm has the characteristics of a strongly bandwidth-bound stencil update scheme, it is significantly different from the popular stencil types that have been exhaustively studied in the high performance computing literature to date. We apply a recently developed stencil optimization technique, multicore wavefront diamond tiling with multi-dimensional cache block sharing, and describe in detail the peculiarities that need to be considered due to the special stencil structure. Concurrency in updating the components of the electric and magnetic fields provides an additional level of parallelism. The dependence of the cache size requirement of the optimized code on the blocking parameters is modeled accurately, and an auto-tuner searches for optimal configurations in the remaining parameter space. We were able to completely decouple the execution from the memory bandwidth bottleneck, accelerating the implementation by a factor of three to four compared to an optimal implementation with pure spatial blocking on an 18-core Intel Haswell CPU.

[1]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[3]  A Thesis,et al.  Tiling Stencil Computations to Maximize Parallelism , 2013 .

[4]  Thomas Ilsche,et al.  An Energy Efficiency Feature Survey of the Intel Haswell Processor , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[5]  R. B. Standler,et al.  A frequency-dependent finite-difference time-domain formulation for dispersive materials , 1990 .

[6]  Dennis M. Sullivan,et al.  Frequency-dependent FDTD methods using Z transforms , 1992 .

[7]  Gerhard Wellein,et al.  Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model , 2014, ICS.

[8]  Christoph Pflaum,et al.  An iterative solver for the finite-difference frequency-domain (FDFD) method for the simulation of materials with negative permittivity , 2011, Numer. Linear Algebra Appl..

[9]  David E. Keyes,et al.  Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..

[10]  Christoph Pflaum,et al.  Studying the effect of scattering layers on the efficiency of thin film solar cells , 2014, Numerical Simulation of Optoelectronic Devices, 2014.

[11]  Jean-Pierre Berenger,et al.  A perfectly matched layer for the absorption of electromagnetic waves , 1994 .

[12]  Gerhard Wellein,et al.  LIKWID: Lightweight Performance Tools , 2011, CHPC.

[13]  Martin A. Green,et al.  Solar cell efficiency tables (version 46) , 2015 .

[14]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[15]  Guang R. Gao,et al.  Locality aware concurrent start for stencil applications , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[16]  Om P. Gandhi,et al.  A frequency-dependent finite-difference time-domain formulation for general dispersive media , 1993 .

[17]  K. Yee Numerical solution of initial boundary value problems involving maxwell's equations in isotropic media , 1966 .

[18]  Guang R. Gao,et al.  Mapping the FDTD Application to Many-Core Chip Architectures , 2009, 2009 International Conference on Parallel Processing.

[19]  R. J. Luebbers,et al.  Piecewise linear recursive convolution for dispersive media using FDTD , 1996 .

[20]  Christoph J. Brabec,et al.  Numerical simulation of light propagation in silver nanowire films using time-harmonic inverse iterative method , 2013 .

[21]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[22]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[23]  David E. Keyes,et al.  Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations , 2015, ACM Trans. Parallel Comput..

[24]  Roger W. Hockney,et al.  F1/2: a Parameter to Characterize Memory and Communication Bottlenecks , 1989, Parallel Comput..

[25]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[26]  A. Erdmann,et al.  Finite integration (FI) method for modelling optical waves in lithography masks , 2009, 2009 International Conference on Electromagnetics in Advanced Applications.

[27]  C. Leopold Tight Bounds on Capacity Misses for 3D Stencil Codes , 2002 .

[28]  Christoph Pflaum,et al.  The SiSoFlex Project: Silicon Based Thin-Film Solar Cells on Flexible Aluminium Substrates , 2014 .

[29]  Hans-Peter Seidel,et al.  Cache Accurate Time Skewing in Iterative Stencil Computations , 2011, 2011 International Conference on Parallel Processing.

[30]  Katherine Yelick,et al.  Auto-tuning stencil codes for cache-based multicore platforms , 2009 .