Generating and auto-tuning parallel stencil codes

In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform. A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation. The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology. The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance. Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually. We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code.

[1]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Palash Sarkar,et al.  A brief history of cellular automata , 2000, CSUR.

[3]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[4]  David E. Keyes,et al.  Exaflop/s: The why and the how , 2011 .

[5]  James Demmel,et al.  Minimizing Communication for Eigenproblems and the Singular Value Decomposition , 2010, ArXiv.

[6]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[7]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[8]  Andrew V. Goldberg,et al.  PHAST: Hardware-Accelerated Shortest Path Trees , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[9]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[10]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[11]  Zeljko Vujaskovic,et al.  Re-setting the biologic rationale for thermal therapy , 2005, International journal of hyperthermia : the official journal of European Society for Hyperthermic Oncology, North American Hyperthermia Group.

[12]  Rob H. Bisseling,et al.  Cache-Oblivious Sparse Matrix--Vector Multiplication by Using Sparse Matrix Partitioning Methods , 2009, SIAM J. Sci. Comput..

[13]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[14]  Ulrich Rüde,et al.  A framework that supports in writing performance-optimized stencil-based codes , 2010 .

[15]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[16]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[17]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[18]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[19]  Monica S. Lam,et al.  Efficient and exact data dependence analysis , 1991, PLDI '91.

[20]  Matthias Müller-Hannemann,et al.  Algorithm Engineering: Bridging the Gap between Algorithm Theory and Practice [outcome of a Dagstuhl Seminar] , 2010, Algorithm Engineering.

[21]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[22]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[23]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[24]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[25]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[26]  Samuel Williams,et al.  Automatic Thread-Level Parallelization in the Chombo AMR Library , 2011 .

[27]  Jonathan Walpole,et al.  Is Parallel Programming Hard, And If So, Why? , 2009 .

[28]  Martin Griebl,et al.  Code generation in the polytope model , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[29]  Esra Neufeld,et al.  Manycore Stencil Computations in Hyperthermia Applications , 2010, Scientific Computing with Multicore and Accelerators.

[30]  H. H. Penns Analysis of tissue and arterial blood temperatures in the resting human forearm , 1948 .

[31]  E. Neufeld,et al.  The HYPERcollar: A novel applicator for hyperthermia in the head and neck , 2007 .

[32]  George Karypis,et al.  Introduction to Parallel Computing Solution Manual , 2003 .

[33]  Margarethus M. Paulides,et al.  A head and neck hyperthermia applicator: Theoretical antenna array design , 2007, International journal of hyperthermia : the official journal of European Society for Hyperthermic Oncology, North American Hyperthermia Group.

[34]  Jean-Pierre Berenger,et al.  A perfectly matched layer for the absorption of electromagnetic waves , 1994 .

[35]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[36]  Sanjit A. Seshia,et al.  Sketching stencils , 2007, PLDI '07.

[37]  Hans-Peter Seidel,et al.  Cache oblivious parallelograms in iterative stencil computations , 2010, ICS '10.

[38]  Rudolf Eigenmann,et al.  PEAK—a fast and effective performance tuning system via compiler optimization orchestration , 2008, TOPL.

[39]  Rudolf Eigenmann,et al.  Programming Distributed Memory Sytems Using OpenMP , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[40]  William J. Dally,et al.  A tuning framework for software-managed memory hierarchies , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[41]  Kenneth A. Hawick,et al.  Auto-generation of Parallel Finite-Differencing Code for MPI, TBB and CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[42]  Mitsuhisa Sato,et al.  The Omni OpenMP Compiler on the Distributed Shared Memory of Cenju-4 , 2001, WOMPAT.

[43]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[44]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[45]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[46]  Christian Blum,et al.  Metaheuristics in combinatorial optimization: Overview and conceptual comparison , 2003, CSUR.

[47]  Zhiyuan Li,et al.  Automatic tiling of iterative stencil loops , 2004, TOPL.

[48]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[49]  C. J. Price,et al.  A hybrid Hooke and Jeeves|Direct method for non-smooth optimization. , 2009 .

[50]  Vivek Sarkar,et al.  Report on the Experimental Language X10 , 2006 .

[51]  Peter Messmer,et al.  Accelerating Stencil-Based Computations by Increased Temporal Locality on Modern Multi- and Many-Core Architectures , 2008 .

[52]  E. Neufeld High Resolution Hyperthermia Treatment Planning , 2008 .

[53]  C. Mack,et al.  Seeing double , 2021, Nature Catalysis.

[54]  Robert J. Fowler,et al.  Increasing Temporal Locality with Skewing and Recursive Blocking , 2001, International Conference on Software Composition.

[55]  Helmar Burkhart,et al.  Automatic code generation and tuning for stencil kernels on modern shared memory architectures , 2011, Computer Science - Research and Development.

[56]  Benjamin Hindman,et al.  Lithe: enabling efficient composition of parallel libraries , 2009 .

[57]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[58]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[59]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[60]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[61]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[62]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[63]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[64]  Helmar Burkhart,et al.  Run, Stencil, Run! – A Comparison of Modern Parallel Programming Paradigms , 2011 .

[65]  J. Zee Heating the patient : a promising approach ? , 2002 .

[66]  Master Gardener,et al.  Mathematical games: the fantastic combinations of john conway's new solitaire game "life , 1970 .

[67]  Alan Edelman,et al.  Language and compiler support for auto-tuning variable-accuracy algorithms , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[68]  Samuel Williams,et al.  Auto-Tuning Stencil Computations on Multicore and Accelerators , 2010, Scientific Computing with Multicore and Accelerators.

[69]  M. J. D. Powell,et al.  An efficient method for finding the minimum of a function of several variables without calculating derivatives , 1964, Comput. J..

[70]  Antoine Petitet,et al.  Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005 .

[71]  Dhabaleswar K. Panda,et al.  Scalable Earthquake Simulation on Petascale Supercomputers , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[72]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[73]  Laxmikant V. Kalé,et al.  Architectural Constraints to Attain 1 Exaflop/s for Three Scientific Application Classes , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[74]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[75]  Helmar Burkhart,et al.  Implementing the ALWAN Communication and Data Distribution Library Using PVM , 1996, PVM.

[76]  John Shalf,et al.  SEJITS: Getting Productivity and Performance With Selective Embedded JIT Specialization , 2010 .

[77]  E Neufeld,et al.  Novel conformal technique to reduce staircasing artifacts at material boundaries for FDTD modeling of the bioheat equation , 2007, Physics in medicine and biology.

[78]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[79]  Robert Frank,et al.  ALWAN: A Skeleton Programming Language , 1996, COORDINATION.

[80]  Samuel Williams,et al.  Auto-tuning performance on multicore computers , 2008 .

[81]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[82]  M. Berger,et al.  Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .

[83]  C. D. Perttunen,et al.  Lipschitzian optimization without the Lipschitz constant , 1993 .

[84]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[85]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[86]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[87]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[88]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[89]  Helmar Burkhart,et al.  PATUS: A Code Generation and Auto-Tuning Framework For Parallel Stencil Computations , 2011 .

[90]  Wilhelm Burger,et al.  Digital Image Processing - An Algorithmic Introduction using Java , 2016, Texts in Computer Science.

[91]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[92]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[93]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[94]  G. Nemhauser,et al.  Integer Programming , 2020 .

[95]  David G. Wonnacott,et al.  Time Skewing for Parallel Computers , 1999, LCPC.

[96]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[97]  Pen-Chung Yew,et al.  Some results on exact data dependence analysis , 1990 .

[98]  Volker Strumpen,et al.  The Cache Complexity of Multithreaded Cache Oblivious Algorithms , 2009, SPAA '06.

[99]  Samuel Williams,et al.  TORCH Computational Reference Kernels - A Testbed for Computer Science Research , 2010 .

[100]  Andrew Lumsdaine,et al.  Single-Source Shortest Paths with the Parallel Boost Graph Library , 2006, The Shortest Path Problem.

[101]  Bradford L. Chamberlain,et al.  The case for high-level parallel programming in ZPL , 1998 .

[102]  Kunle Olukotun,et al.  Language virtualization for heterogeneous parallel computing , 2010, OOPSLA.

[103]  Samuel Williams,et al.  Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.

[104]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[105]  Peter Schlag,et al.  Clinical use of the hyperthermia treatment planning system HyperPlan to predict effectiveness and toxicity. , 2003, International journal of radiation oncology, biology, physics.

[106]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[107]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[108]  C. Dimitrakopoulos,et al.  100 GHz Transistors from Wafer Scale Epitaxial Graphene , 2010, 1002.3845.

[109]  C. T. Kelley,et al.  Modifications of the direct algorithm , 2001 .

[110]  S. Weinbaum,et al.  A new simplified bioheat equation for the effect of blood flow on local average tissue temperature. , 1985, Journal of biomechanical engineering.

[111]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[112]  V. Torczon,et al.  Direct search methods: then and now , 2000 .

[113]  P. Wust,et al.  Hyperthermia in combined treatment of cancer. , 2002, The Lancet Oncology.

[114]  Zhiyuan Li,et al.  Data dependence analysis on multi-dimensional array references , 1989, ICS '89.

[115]  Frank Lemke,et al.  High-density active optical cable: from a new concept to a prototype , 2011, OPTO.

[116]  Christoforos E. Kozyrakis,et al.  RAMP: Research Accelerator for Multiple Processors , 2007, IEEE Micro.

[117]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[118]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[119]  Gerhard Wellein,et al.  Efficient multicore-aware parallelization strategies for iterative stencil computations , 2010, J. Comput. Sci..

[120]  Steve Carr,et al.  Unroll-and-jam using uniformly generated sets , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[121]  William J. Dally,et al.  Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[122]  장훈,et al.  [서평]「Computer Organization and Design, The Hardware/Software Interface」 , 1997 .

[123]  Toshiyuki Shimizu,et al.  Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers , 2009, Computer.

[124]  Peter Messmer,et al.  Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[125]  James Demmel,et al.  Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..

[126]  Arie van Deursen,et al.  Domain-specific languages: an annotated bibliography , 2000, SIGP.

[127]  Michael Gschwind Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[128]  Ananta Tiwari,et al.  Online Adaptive Code Generation and Tuning , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[129]  Sanjay V. Rajopadhye,et al.  Parameterized tiled loops for free , 2007, PLDI '07.

[130]  Kevin Skadron,et al.  A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations , 2011, International Journal of Parallel Programming.

[131]  James Demmel,et al.  Communication avoiding Gaussian elimination , 2008, HiPC 2008.

[132]  Robert Hooke,et al.  `` Direct Search'' Solution of Numerical and Statistical Problems , 1961, JACM.

[133]  Luis A. Dalguer,et al.  Staggered-grid split-node method for spontaneous rupture simulation , 2007 .

[134]  Rudolf Eigenmann,et al.  Cetus: A Source-to-Source Compiler Infrastructure for Multicores , 2009, Computer.

[135]  Ron Cytron,et al.  Interprocedural dependence analysis and parallelization , 1986, SIGP.

[136]  Leonid Oliker,et al.  Towards Ultra-High Resolution Models of Climate and Weather , 2008, Int. J. High Perform. Comput. Appl..

[137]  Robert Strzodka,et al.  Time skewing made simple , 2011, PPoPP '11.

[138]  John Randal Allen,et al.  Dependence analysis for subscripted variables and its application to program transformations , 1983 .

[139]  John D. McCalpin,et al.  Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .

[140]  David A. Bader,et al.  Parallel Shortest Path Algorithms for Solving Large-Scale Instances , 2006, The Shortest Path Problem.

[141]  John Paul Strachan,et al.  The switching location of a bipolar memristor: chemical, thermal and structural mapping , 2011, Nanotechnology.