Autotuning divide‐and‐conquer stencil computations

This paper explores autotuning strategies for serial divide‐and‐conquer stencil computations, comparing the efficacy of traditional “heuristic” autotuning with that of “pruned‐exhaustive” autotuning. We present a pruned‐exhaustive autotuner called Ztune that searches for optimal divide‐and‐conquer trees for stencil computations. Ztune uses three pruning properties—space‐time equivalence, divide subsumption, and favored dimension—that greatly reduce the size of the search domain without significantly sacrificing the quality of the autotuned code. We compared the performance of Ztune with that of a state‐of‐the‐art heuristic autotuner called OpenTuner in tuning the divide‐and‐conquer algorithm used in Pochoir stencil compiler. Over a nightly run on ten application benchmarks across two machines with different hardware configurations, the Ztuned code ran 5% –12% faster on average, and the OpenTuner tuned code ran from 9% slower to 2% faster on average, than Pochoir's default code. In the best case, the Ztuned code ran 40% faster, and the OpenTuner tuned code ran 33% faster than Pochoir's code. Whereas the autotuning time of Ztune for each benchmark could be measured in minutes, to achieve comparable results, the autotuning time of OpenTuner was typically measured in hours or days. Surprisingly, for some benchmarks, Ztune actually autotuned faster than the time it takes to perform the stencil computation once.

[1]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[2]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[3]  David E. Keyes,et al.  Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..

[4]  Jorge Nuno Silva,et al.  Mathematical Games , 1959, Nature.

[5]  Shoaib Ashraf Kamil,et al.  Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages , 2012 .

[6]  J. Hull Options, Futures, and Other Derivatives , 1989 .

[7]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[8]  A. Nakano,et al.  Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers , 1994 .

[9]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[10]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[12]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Payut Pantawongdecha Autotuning divide-and-conquer matrix-vector multiplication , 2016 .

[14]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[15]  José M. F. Moura,et al.  Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms , 2004, Int. J. High Perform. Comput. Appl..

[16]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[17]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[18]  A. Nitsure,et al.  Implemenation and optimization of a cache-oblivious Lattice Boltzmann algorithm , 2006 .

[19]  Allen Taflove,et al.  Computational Electrodynamics the Finite-Difference Time-Domain Method , 1995 .

[20]  Volker Strumpen,et al.  The Cache Complexity of Multithreaded Cache Oblivious Algorithms , 2009, SPAA '06.

[21]  Samuel Williams,et al.  Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms , 2009, J. Parallel Distributed Comput..

[22]  Volker Strumpen,et al.  The cache complexity of multithreaded cache oblivious algorithms , 2006, SPAA.

[23]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[24]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[25]  Weiqiang Wang,et al.  In-Core Optimization of High-Order Stencil Computations , 2009, PDPTA.

[26]  Rainer Bleck,et al.  Salinity-driven Thermocline Transients in a Wind- and Thermohaline-forced Isopycnic Coordinate Model of the North Atlantic , 1992 .

[27]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[28]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[29]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[30]  David E. Keyes,et al.  Optimization of an Electromagnetics Code with Multicore Wavefront Diamond Blocking and Multi-dimensional Intra-Tile Parallelization , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[31]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[32]  Uday Bondhugula,et al.  Tiling and optimizing time-iterated computations over periodic domains , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[33]  James F. Epperson,et al.  An Introduction to Numerical Methods and Analysis , 2001 .

[34]  Wei Shyy,et al.  Lattice Boltzmann Method for 3-D Flows with Curved Boundary , 2000 .

[35]  Uday Bondhugula,et al.  Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[36]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[37]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[38]  Liu Peng,et al.  High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[39]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[40]  Weiqiang Wang,et al.  A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.

[41]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).