Portable and productive high-performance computing

Performance portability of computer programs, and programmer productivity in writing them are key expectations in software engineering. These expectations lead to the following questions: Can programmers write code once, and execute it at optimal speed on any machine configuration? Can programmers write parallel code to simple models that hide the complex details of parallel programming? This thesis addresses these questions for certain "classes" of computer programs. It describes "autotuning" techniques that achieve performance portability for serial divide-and-conquer programs, and an abstraction that improves programmer productivity in writing parallel code for a class of programs called "Star". We present a "pruned-exhaustive" autotuner called Ztune that optimizes the performance of serial divide-and-conquer programs for a given machine configuration. Whereas the traditional way of autotuning divide-and-conquer programs involves simply coarsening the base case of recursion optimally, Ztune searches for optimal divide-and-conquer trees. Although Ztune, in principle, exhaustively enumerates the search domain, it uses pruning properties that greatly reduce the size of the search domain without significantly sacrificing the quality of the autotuned code. We illustrate how to autotune divide-and-conquer stencil computations using Ztune, and present performance comparisons with state-of-the-art "heuristic" autotuning. Not only does Ztune autotune significantly faster than a heuristic autotuner, the Ztuned programs also run faster on average than their heuristic autotuner tuned counterparts. Surprisingly, for some stencil benchmarks, Ztune actually autotuned faster than the time it takes to execute the stencil computation once. We introduce the Star class that includes many seemingly different programs like solving symmetric, diagonally-dominant tridiagonal systems, executing "watershed" cuts on graphs, sample sort, fast multipole computations, and all-prefix-sums and its various applications. We present a programming model, which is also called Star, to generate and execute parallel code for the Star class of programs. The Star model abstracts the pattern of computation and interprocessor communication in the Star class of programs, hides low-level parallel programming details, and offers ease of expression, thereby improving programmer productivity in writing parallel code. Besides, we also present parallel algorithms, which offer asymptotic improvements over prior art, for two programs in the Star class - a Trip algorithm for solving symmetric, diagonally-dominant tridiagonal systems, and a Wasp algorithm for executing watershed cuts on graphs. The Star model is implemented in the Julia programming language, and leverages Julia's capabilities in expressing parallelism in code concisely, and in supporting both shared-memory and distributed-memory parallel programming alike.

[1]  Allen Taflove,et al.  Computational Electrodynamics the Finite-Difference Time-Domain Method , 1995 .

[2]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[3]  Gene H. Golub,et al.  Cyclic Reduction - History and Applications , 1997 .

[4]  David E. Keyes,et al.  Optimization of an Electromagnetics Code with Multicore Wavefront Diamond Blocking and Multi-dimensional Intra-Tile Parallelization , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[5]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[6]  Liu Peng,et al.  High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[7]  James F. Epperson,et al.  An Introduction to Numerical Methods and Analysis , 2001 .

[8]  Fred G. Gustavson,et al.  A New Parallel Algorithm for Tridiagonal Symmetric Positive Definite Systems of Equations , 1996, PARA.

[9]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[10]  Uday Bondhugula,et al.  Tiling and optimizing time-iterated computations over periodic domains , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[11]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[12]  Hee-Seok Kim,et al.  A scalable, numerically stable, high-performance tridiagonal solver using GPUs , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Uday Bondhugula,et al.  Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[14]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[15]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Samuel Williams,et al.  Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms , 2009, J. Parallel Distributed Comput..

[18]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[19]  Rainer Bleck,et al.  Salinity-driven Thermocline Transients in a Wind- and Thermohaline-forced Isopycnic Coordinate Model of the North Atlantic , 1992 .

[20]  Volker Strumpen,et al.  The cache complexity of multithreaded cache oblivious algorithms , 2006, SPAA.

[21]  Aleksandar Zlateski A design and implementation of an efficient, parallel watershed algorithm for affinity graphs , 2011 .

[22]  A. Nakano,et al.  Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers , 1994 .

[23]  Wei Shyy,et al.  Lattice Boltzmann Method for 3-D Flows with Curved Boundary , 2000 .

[24]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[25]  Gilles Bertrand,et al.  Watershed Cuts: Minimum Spanning Forests and the Drop of Water Principle , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[27]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[28]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[29]  José M. F. Moura,et al.  Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms , 2004, Int. J. High Perform. Comput. Appl..

[30]  Weiqiang Wang,et al.  A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.

[31]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[32]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[33]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[34]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[35]  Harold S. Stone,et al.  An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations , 1973, JACM.

[36]  W. Donald Frazer,et al.  Samplesort: A Sampling Approach to Minimal Storage Tree Sorting , 1970, JACM.

[37]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[38]  Robert E. Tarjan,et al.  Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[39]  Hans Burkhardt,et al.  A Parallel Watershed Algorithm , 1996 .

[40]  Guy E. Blelloch,et al.  A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[41]  Maryam Mehri Dehnavi,et al.  Autotuning divide‐and‐conquer stencil computations , 2017, Concurr. Comput. Pract. Exp..

[42]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[43]  Weiqiang Wang,et al.  In-Core Optimization of High-Order Stencil Computations , 2009, PDPTA.

[44]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[45]  Payut Pantawongdecha Autotuning divide-and-conquer matrix-vector multiplication , 2016 .

[46]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[47]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[48]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[49]  Robert E. Tarjan,et al.  Efficiency of a Good But Not Linear Set Union Algorithm , 1972, JACM.

[50]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[51]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[52]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[53]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[54]  David E. Keyes,et al.  Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..

[55]  Shoaib Ashraf Kamil,et al.  Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages , 2012 .

[56]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[57]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .