Domain-specific translator and optimizer for massive on- chip parallelism

Future supercomputers will rely on massive on-chip parallelism that requires dramatic changes be made to node architecture. Node architecture will become more heterogeneous and hierarchical, with software-managed on-chip memory becoming more prevalent. To meet the performance expectations, application software will undergo extensive redesign. In response, support from programming models is crucial to help scientists adopt new technologies without requiring significant programming effort. In this dissertation, we address the programming issues of a massively parallel single chip processor with a software-managed memory. We propose the Mint programming model and domain-specific compiler as a means of simplifying application development. Mint abstracts away the programmer's view of the hardware by providing a high-level interface to low-level architecture-specific optimizations. The Mint model requires modest recoding of the application and is based on a small number of compiler directives, which are sufficient to take advantage of massive parallelism. We have implemented the Mint model on a concrete instance of a massively parallel single chip processor: the Nvidia GPU (Graphics Processing Unit). The Mint source-to-source translator accepts C source with Mint annotations and generates CUDA C. The translator includes a domain-specific optimizer targeting stencil methods. Stencil methods arise in image processing applications and in a wide range of partial differential equation solvers. The Mint optimizer performs data locality optimizations, and uses on-chip memory to reduce memory accesses, particularly useful for stencil methods. We have demonstrated the effectiveness of Mint on a set of widely used stencil kernels and three real-world applications. The applications include an earthquake-induced seismic wave propagation code, an interest point detection algorithm for volume datasets and a model for signal propagation in cardiac tissue. In cases where hand-coded implementations are available, we have verified that Mint delivered competitive performance. Mint realizes around 80% of the performance of the hand-optimized CUDA implementations of the kernels and applications on the Tesla C1060 and C2050 GPUs. By facilitating the management of parallelism and the memory hierarchy on the chip at a high-level, Mint enables computational scientists to accelerate their software development time. Furthermore, by performing domain-specific optimizations, Mint delivers high performance for stencil methods.

[1]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[2]  Rudolf Eigenmann,et al.  Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.

[3]  J. Shalf,et al.  Lawrence Berkeley National Laboratory Recent Work Title Auto-Tuning the 27-point Stencil for Multicore Permalink , 2009 .

[4]  Ulrich Rüde,et al.  Memory Characteristics of Iterative Methods , 1999, SC.

[5]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[6]  Markus Schordan,et al.  Treating a user-defined parallel library as a domain-specific language , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[7]  Jianbin Fang,et al.  A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.

[8]  Philip M. Morse,et al.  Methods of Mathematical Physics , 1947, The Mathematical Gazette.

[9]  Kunle Olukotun,et al.  Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency , 2007 .

[10]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[12]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[13]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Xing Cai,et al.  STABILITY OF TWO TIME-INTEGRATORS FOR THE ALIEV-PANFILOV SYSTEM , 2011 .

[15]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[16]  Edmond Chow,et al.  Exploiting 162-Nanosecond End-to-End Communication Latency on Anton , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  François Bodin,et al.  Heterogeneous multicore parallel programming for graphics processing units , 2009 .

[18]  Dean M. Tullsen,et al.  Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[19]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[20]  William J. Dally,et al.  Imagine: Media Processing with Streams , 2001, IEEE Micro.

[21]  William E. Lorensen,et al.  The Transfer Function Bake-Off , 2001, IEEE Computer Graphics and Applications.

[22]  Mahmut T. Kandemir,et al.  Leakage Current: Moore's Law Meets Static Power , 2003, Computer.

[23]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[24]  Tor Gillberg,et al.  A New Parallel 3D Front Propagation Algorithm for Fast Simulation of Geological folds , 2012, ICCS.

[25]  Samuel Williams,et al.  Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Dhabaleswar K. Panda,et al.  Scalable Earthquake Simulation on Petascale Supercomputers , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Samuel Williams,et al.  Auto-tuning performance on multicore computers , 2008 .

[28]  Samuel Williams,et al.  Hardware/software co-design for energy-efficient seismic modeling , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[29]  Kim B. Olsen,et al.  On the implementation of perfectly matched layers in a three‐dimensional fourth‐order velocity‐stress finite difference scheme , 2003 .

[30]  Barbara Chapman,et al.  Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation) , 2007 .

[31]  N. Britton Reaction-diffusion equations and their applications to biology. , 1989 .

[32]  Stephen W. Poole,et al.  An idiom-finding tool for increasing productivity of accelerators , 2011, ICS '11.

[33]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[34]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[35]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[36]  Arie E. Kaufman,et al.  GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[37]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[38]  Joe Michael Kniss,et al.  Multidimensional Transfer Functions for Interactive Volume Rendering , 2002, IEEE Trans. Vis. Comput. Graph..

[39]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[40]  S. TIMOSHENKO,et al.  An Introduction to the Theory of Elasticity: , 1936, Nature.

[41]  Richard W. Vuduc,et al.  Effective Source-to-Source Outlining to Support Whole Program Empirical Optimization , 2009, LCPC.

[42]  P. Maechling,et al.  Strong shaking in Los Angeles expected from southern San Andreas earthquake , 2006 .

[43]  Michael Wolfe,et al.  Implementing the PGI Accelerator model , 2010, GPGPU-3.

[44]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[45]  Trevor N. Mudge,et al.  Power: A First-Class Architectural Design Constraint , 2001, Computer.

[46]  J. Strikwerda Finite Difference Schemes and Partial Differential Equations , 1989 .

[47]  Maryann E. Martone,et al.  Dimensionality Reduction on Multi-Dimensional Transfer Functions for Multi-Channel Volume Data Sets , 2010, Inf. Vis..

[48]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[49]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[50]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[51]  R. Aliev,et al.  A simple two-variable model of cardiac excitation , 1996 .

[52]  Dimitri Komatitsch,et al.  Accelerating a three-dimensional finite-difference wave propagation code using GPU graphics cards , 2010 .

[53]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[54]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[55]  James C. Hoe,et al.  Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[56]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[57]  Scott B. Baden,et al.  Source-to-Source Optimization of CUDA C for GPU Accelerated Cardiac Cell Modeling , 2010, Euro-Par.

[58]  Tarek S. Abdelrahman,et al.  hiCUDA: a high-level directive-based language for GPU programming , 2009, GPGPU-2.

[59]  Scott B. Baden,et al.  Interactive data-centric viewpoint selection , 2012, Visualization and Data Analysis.

[60]  Luis A. Dalguer,et al.  Staggered-grid split-node method for spontaneous rupture simulation , 2007 .

[61]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[62]  Peter M. Athanas,et al.  Examining the Viability of FPGA Supercomputing , 2007, EURASIP J. Embed. Syst..

[63]  Samuel Williams,et al.  Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.

[64]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[65]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[66]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.