Performance Engineering of Numerical Software on Multi- and Manycore Processors

The goal of performance engineering is to make the resource usage a controllable property of software. This thesis contributes to this field from the perspective of a computer scientist who is involved in scientific computing. In the beginning, it provides the necessary basis: This is first an understanding of the development process of numerical software and of the emergence of performance at the hardware/software interface. But further, methods to model, predict, analyze, measure, and assess software performance and a tool set of optimization techniques to improve it accordingly are required. Then adaption and application of established and novel approaches are described at numerical algorithms and applications. The first study is a port of a fluid simulation in complex geometries using a lattice Boltzmann method to the Cell Broadband Engine Architecture, in order to profit from its high arithmetic and memory throughput of its unique heterogeneous design. Although the previous data structures provided a good compromise between flexibility and regularity for cache-based architectures, a thorough adaption to the peculiarities of the accelerator cores of this architecture and subsequent low-level optimization are required. The Cell variant of the simulation can nearly saturate the memory interface when only about half of the accelerator cores are employed. Using single precision on an IBM QS 20 Cell blade, the program is able to update 200 million fluid lattice cells per second for a simple channel flow and 90 million when simulating the flow through a real intracranial vessel geometry with an aneurysm. The batch variant of the orthogonal matching pursuit algorithm allows to efficiently derive sparse representations for a large number of signals. It is based on common operations in linear algebra, i.e. on dense and sparse vectors and matrices. Eventually, comparable performance is achieved on an dual-socket system with a total of twelve cores (Intel Westmere microarchitecture), an IBM QS 20 Cell blade, and on an NVIDIA GeForce GTX 480 graphics card. Each architecture, however, needs a different approach to achieve this. On the general purpose processor, an implementation in C99 that has been primarily optimized on the algorithmic level is accelerated by addition of compiler hints and tuning of the compilation process. Preliminary tests show that this approach results in disappointing performance on the Cell Broadband Engine Architecture. Therefore a complete re-implementation that employs a sophisticated memory layout to enable platform-specific low-level optimizations is developed. The graphics card demands for reorganization of the algorithm to enable massive parallelism and a modified data layout to enable data access patterns that fit the architecture. Stencil computations on regular grids are often memory bound. They can then only be accelerated considerably by use of temporal blocking techniques that fuse multiple of such operations in order to reduce the associated memory transfer. As temporal blocking is generally tedious to implement, a novel generic approach is proposed. It enables temporal blocking on cache-based architectures as well as on the Cell Broadband Engine Architecture whose accelerator cores operate only on scratchpad memory, it can be strongly supported by a cross-platform framework, and it allows for good prediction of resulting main memory transfer. Implementations of a correction scheme multigrid method to solve Poisson’s equation as well as a Full Approximation Scheme multigrid method to solve a diffusive partial differential equation with complex numbers are described, analyzed, and measured on a Playstation 3 and on a common dual-socket workstation (Intel Harpertown microarchitecture). For the Poisson multigrid method, predicted and measured amount of memory transfer agree well on both test platforms, and its components are at least 1.5 and up three times as fast as optimistic estimates for alternative implementations that do not block temporally. For the complex diffusion multigrid method, prediction of data transfer agrees well on Playstation 3 and is rather close on the Xeon system. The kernels, however, could not be accelerated sufficiently, so that at least highly optimized alternatives could be competitive on the latter platform and that there is no benefit on the Playstation 3 in terms of run time. Design and implementation of a software program which is able to simulate heat conduction within rolls in hot rolling mills in real-time with the aid of graphics cards concludes the list of examples. As this application requires outstanding run time performance, a holistic co-design approach is taken which matches model, numerical method, implementation, and target platform. This study does therefore not only derive and verify a fast approach to numerically compute the heat equation in cylindrical geometries with rapidly changing boundary conditions, but also demonstrates the exertion and potential of performance engineering principles at its best.

[1]  Volker Strumpen,et al.  The memory behavior of cache oblivious stencil computations , 2007, The Journal of Supercomputing.

[2]  Matthias Christen,et al.  Generating and auto-tuning parallel stencil codes , 2011 .

[3]  D. Brandt,et al.  Multi-level adaptive solutions to boundary-value problems math comptr , 1977 .

[4]  John von Neumann,et al.  First draft of a report on the EDVAC , 1993, IEEE Annals of the History of Computing.

[5]  Ulrich Rüde,et al.  Fluid flow simulation on the Cell Broadband Engine using the lattice Boltzmann method , 2009, Comput. Math. Appl..

[6]  Shmuel Peleg,et al.  Seamless Image Stitching in the Gradient Domain , 2004, ECCV.

[7]  Markus Kowarschik,et al.  Data locality optimizations for iterative numerical algorithms and cellular automata on hierarchical memory architectures , 2004, Advances in simulation.

[8]  William Jalby,et al.  Hardware Performance Monitoring for the Rest of Us: A Position and Survey , 2011, NPC.

[9]  Fan Yang,et al.  Super-Resolution from One Single Low-Resolution Image Based on R-KSVD and Example-Based Algorithm , 2013, IDEAL.

[10]  Michael Elad,et al.  Efficient Implementation of the K-SVD Algorithm using Batch Orthogonal Matching Pursuit , 2008 .

[11]  Christian Weiß,et al.  Data locality optimizations for multigrid methods on structured grids , 2001 .

[12]  J. C. Jaeger,et al.  Conduction of Heat in Solids , 1952 .

[13]  Marcus Mohr,et al.  Cell-centred multigrid revisited , 2004 .

[14]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[15]  Michael Elad,et al.  Submitted to Ieee Transactions on Image Processing Image Decomposition via the Combination of Sparse Representations and a Variational Approach , 2022 .

[16]  David G. Wonnacott,et al.  Time Skewing for Parallel Computers , 1999, LCPC.

[17]  Jack Dongarra,et al.  SCOP3: A Rough Guide to Scientific Computing On the PlayStation 3 , 2007 .

[18]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[19]  Josef Weidendorfer,et al.  Off-loading application controlled data prefetching in numerical codes for multi-core processors , 2008, Int. J. Comput. Sci. Eng..

[20]  Harald Köstler,et al.  An Orthogonal Matching Pursuit Algorithm for Image Denoising on the Cell Broadband Engine , 2009, PPAM.

[21]  U. Rüde,et al.  Simulation of Heat-Induced Elastic Deformation of Cylindrical-Shaped Bodies , 2010 .

[22]  Nancy S. Pollard,et al.  Real-time gradient-domain painting , 2008, ACM Trans. Graph..

[23]  Gerhard Wellein,et al.  Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results , 2011, ArXiv.

[24]  Nicolas Legrand,et al.  Analysis of Roll Gap Heat Transfers in Hot Steel Strip Rolling through Roll Temperature Sensors and Heat Transfer Models , 2012 .

[25]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[26]  T. Chan,et al.  On the Convergence of the Lagged Diffusivity Fixed Point Method in Total Variation Image Restoration , 1999 .

[27]  Thomas Zeiser,et al.  Performance evaluation of a parallel sparse lattice Boltzmann solver , 2008, J. Comput. Phys..

[28]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[29]  Juliane Junker,et al.  Computer Organization And Design The Hardware Software Interface , 2016 .

[30]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[31]  Ulrich Rüde,et al.  A framework that supports in writing performance-optimized stencil-based codes , 2010 .

[32]  H. D. Baehr,et al.  Wärme- und Stoffübertragung , 1994 .

[33]  J. Boon The Lattice Boltzmann Equation for Fluid Dynamics and Beyond , 2003 .

[34]  Harald Köstler,et al.  Real-time simulation of temperature in hot rolling rolls , 2014, J. Comput. Sci..

[35]  Ulrich Rüde,et al.  A flexible Patch-based lattice Boltzmann parallelization approach for heterogeneous GPU-CPU clusters , 2010, Parallel Comput..

[36]  Gerhard Wellein,et al.  Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.

[37]  Harald Köstler,et al.  Performance engineering to achieve real-time high dynamic range imaging , 2012, Journal of Real-Time Image Processing.

[38]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[39]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[40]  Gerhard Wellein,et al.  Exploring performance and power properties of modern multi‐core chips via simple machine models , 2012, Concurr. Comput. Pract. Exp..

[41]  Matthew Scarpino,et al.  Programming the Cell Processor: For Games, Graphics, and Computation , 2008 .

[42]  Gerhard Wellein,et al.  Towards Optimal Performance for Lattice Boltzmann Applications on Terascale Computers , 2006 .

[43]  Ulrich Rüde,et al.  Fixed and Adaptive Cache Aware Algorithms for Multigrid Methods , 2000 .

[44]  Achi Brandt,et al.  Vectorized multigrid poisson solver for the CDC cyber 205 , 1983 .

[45]  Dietmar Fey,et al.  High Performance Stencil Code Algorithms for GPGPUs , 2011, ICCS.

[46]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[47]  Jan Treibig,et al.  Efficiency improvements of iterative numerical algorithms on modern architectures , 2008 .

[48]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[49]  Wolfgang Hackbusch,et al.  Multi-grid methods and applications , 1985, Springer series in computational mathematics.

[50]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[51]  Wolfgang Joppich,et al.  Practical Fourier Analysis for Multigrid Methods , 2004 .

[52]  Ibm Redbooks,et al.  Programming the Cell Broadband Engine Architecture: Examples and Best Practices , 2008 .

[53]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[54]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[55]  Ulrich Rüde,et al.  Modeling Multigrid Algorithms for Variational Imaging , 2010, 2010 21st Australian Software Engineering Conference.

[56]  Jason N. Dale,et al.  Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..

[57]  Thomas R. Braun,et al.  An evaluation of GPU acceleration for sparse reconstruction , 2010, Defense + Commercial Sensing.

[58]  Georg Hager,et al.  Introducing a Performance Model for Bandwidth-Limited Loop Kernels , 2009, PPAM.

[59]  Robert Strzodka,et al.  Using GPUs to improve multigrid solver performance on a cluster , 2008, Int. J. Comput. Sci. Eng..

[60]  Yehoshua Y. Zeevi,et al.  Image enhancement and denoising by complex diffusion processes , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[62]  Adolfy Hoisie,et al.  Performance Optimization of Numerically Intensive Codes , 1987 .

[63]  P. Wesseling An Introduction to Multigrid Methods , 1992 .

[64]  Diomidis Spinellis,et al.  Code Quality: The Open Source Perspective , 2006 .

[65]  Nils Thürey,et al.  Physically based animation of free surface flows with the Lattice Boltzmann method , 2007 .

[66]  Ulrich Rüde,et al.  Fast Wavelet Transform Utilizing a Multicore-Aware Framework , 2010, PARA.

[67]  Gerhard Wellein,et al.  Performance Patterns and Hardware Metrics on Modern Multicore Processors: Best Practices for Performance Engineering , 2012, Euro-Par Workshops.