Program Optimization Strategies for Data-Parallel Many-Core Processors

Program optimization for highly parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly data-parallel applications for these platforms while lacking the substantial experience and knowledge needed to maximize application performance. In addition, hand-optimization even by motivated and informed developers takes a significant amount of time and generally still underutilizes the performance of the hardware by double-digit percentages. This creates a need for structured and automatable optimization techniques that are capable of finding a near-optimal program configuration for this new class of architecture. My work discusses various strategies for optimizing programs on a highly dataparallel architecture with fine-grained sharing of resources. I first investigate useful strategies in optimizing a suite of applications. I then introduce program optimization carving, an approach that discovers high-performance application configurations for data-parallel, many-core architectures. Instead of applying a particular phase ordering of optimizations, it starts with an optimization space of major transformations and then reduces the space by examining the static code and pruning configurations that do not maximize desirable qualities in isolation or combination. Careful selection of pruning criteria for applications running on the NVIDIA GeForce 8800 GTX reduces the optimization space by as much as 98% while finding configurations within 1% of the best performance. Random

[1]  Mikhail J. Atallah,et al.  Algorithms and Theory of Computation Handbook , 2009, Chapman & Hall/CRC Applied Algorithms and Data Structures series.

[2]  Mary Lou Soffa,et al.  An approach to ordering optimizing transformations , 1990, PPOPP '90.

[3]  Herbert H. J. Hum,et al.  Compilation, architectural support, and evaluation of SIMD graphics pipeline programs on a general-purpose CPU , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[4]  Steven R. Vegdahl Phase coupling and constant generation in an optimizing microcode compiler , 1982, MICRO 15.

[5]  Daniel Jiménez-González,et al.  Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[6]  Y. N. Srikant,et al.  Microarchitecture Sensitive Empirical Models for Compiler Optimizations , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[7]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[8]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[9]  Milind Girkar,et al.  EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system , 2007, PLDI '07.

[10]  Keith D. Cooper,et al.  Order Matters : Exploring the Structure of the Space of Compilation Sequences Using Randomized Search Algorithms † , 2004 .

[11]  Peter M. W. Knijnenburg,et al.  Automatic selection of compiler options using non-parametric inferential statistics , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[12]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[13]  David I. August,et al.  Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[14]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[15]  Lawrence Rauchwerger,et al.  Polaris: The Next Generation in Parallelizing Compilers , 2000 .

[16]  David Tarditi,et al.  Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.

[17]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[18]  D.A. Reed,et al.  An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[19]  Kenneth E. Iverson,et al.  A programming language , 1899, AIEE-IRE '62 (Spring).

[20]  Ken Kennedy,et al.  Optimizing for parallelism and data locality , 1992, ICS '92.

[21]  Klaus Schulten,et al.  GPU acceleration of cutoff pair potentials for molecular modeling applications , 2008, CF '08.

[22]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.

[23]  Ken Kennedy,et al.  PFC: A Program to Convert Fortran to Parallel Form , 1982 .

[24]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[25]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[26]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[27]  Sarita V. Adve,et al.  Code transformations to improve memory parallelism , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[28]  Wen-mei W. Hwu,et al.  A systematic approach to delivering instruction-level parallelism in epic systems , 2005 .

[29]  Marc Snir,et al.  Automatic tuning matrix multiplication performance on graphics hardware , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[30]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[31]  N.K. Govindaraju,et al.  A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[32]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[33]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[34]  Hsueh-Ming Hang,et al.  Motion Estimation for Video Coding Standards , 1997, J. VLSI Signal Process..

[35]  William R. Mark,et al.  Cg: a system for programming graphics hardware in a C-like language , 2003, ACM Trans. Graph..

[36]  Jerrold L. Wagener,et al.  Fortran 90 Handbook: Complete Ansi/Iso Reference , 1992 .

[37]  Wen-mei W. Hwu,et al.  Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.

[38]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[39]  Chau-Wen Tseng,et al.  Software Support For Improving Locality in Scientific Codes , 2001 .

[40]  David Kirk,et al.  NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.

[41]  Wen-mei W. Hwu,et al.  Field-testing IMPACT EPIC research results in Itanium 2 , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[42]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[43]  François Bodin,et al.  Improving cache behavior of dynamically allocated data structures , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[44]  José M. F. Moura,et al.  Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms , 2004, Int. J. High Perform. Comput. Appl..

[45]  Ken Kennedy,et al.  Automatic decomposition of scientific programs for parallel execution , 1987, POPL '87.

[46]  John L. Klepeis,et al.  Anton, a special-purpose machine for molecular dynamics simulation , 2007, ISCA '07.

[47]  Vivek Sarkar,et al.  A general framework for iteration-reordering loop transformations , 1992, PLDI '92.

[48]  Zhaohui Du,et al.  Data and computation transformations for Brook streaming applications on multiprocessors , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[49]  Pat Hanrahan,et al.  Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[50]  Uday Bondhugula,et al.  Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories , 2008, PPoPP.

[51]  Michael F. P. O'Boyle,et al.  Using machine learning to focus iterative optimization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[52]  P. Slusallek,et al.  RPU: a programmable ray processing unit for realtime ray tracing , 2005, SIGGRAPH '05.

[53]  L. Almagor,et al.  Finding effective compilation sequences , 2004, LCTES '04.

[54]  Michael Metcalf,et al.  High performance Fortran , 1995 .

[55]  Mary Lou Soffa,et al.  Predicting the impact of optimizations for embedded systems , 2003, LCTES '03.

[56]  Ken Kennedy,et al.  The memory of bandwidth bottleneck and its amelioration by a compiler , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[57]  Rajiv Gupta,et al.  Register Pressure Sensitive Redundancy Elimination , 1999, CC.

[58]  Keith D. Cooper,et al.  Combining analyses, combining optimizations , 1995, TOPL.

[59]  Michael D. McCool,et al.  Performance evaluation of GPUs using the RapidMind development platform , 2006, SC.

[60]  Justin P. Haldar,et al.  Accelerating advanced mri reconstructions on gpus , 2008, CF '08.

[61]  Alexander V. Veidenbaum,et al.  EFFECTS OF PROGRAM RESTRUCTURING, ALGORITHM CHANGE, AND ARCHITECTURE CHOICE ON PROGRAM PERFORMANCE. , 1984 .

[62]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[63]  Klaus Schulten,et al.  Accelerating Molecular Modeling Applications with GPU Computing , 2009 .

[64]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[65]  Ken Kennedy,et al.  Automatic data layout for distributed-memory machines , 1998, TOPL.

[66]  Vikram S. Adve,et al.  Compiler Support for Analysis and Tuning Data Parallel Programs , 1995 .

[67]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2004, The Journal of Supercomputing.

[68]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[69]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[70]  Gary S. Tyson,et al.  Evaluating Heuristic Optimization Phase Order Search Algorithms , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[71]  Keith D. Cooper,et al.  Optimizing for reduced code space using genetic algorithms , 1999, LCTES '99.

[72]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[73]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[74]  Ken Kennedy,et al.  Bandwidth-Based Performance Tuning and Prediction , 1999 .