Kremlin: rethinking and rebooting gprof for the multicore age

Many recent parallelization tools lower the barrier for parallelizing a program, but overlook one of the first questions that a programmer needs to answer: which parts of the program should I spend time parallelizing? This paper examines Kremlin, an automatic tool that, given a serial version of a program, will make recommendations to the user as to what regions (e.g. loops or functions) of the program to attack first. Kremlin introduces a novel hierarchical critical path analysis and develops a new metric for estimating the potential of parallelizing a region: self-parallelism. We further introduce the concept of a parallelism planner, which provides a ranked order of specific regions to the programmer that are likely to have the largest performance impact when parallelized. Kremlin supports multiple planner personalities, which allow the planner to more effectively target a particular programming environment or class of machine. We demonstrate the effectiveness of one such personality, an OpenMP planner, by comparing versions of programs that are parallelized according to Kremlin's plan against third-party manually parallelized versions. The results show that Kremlin's OpenMP planner is highly effective, producing plans whose performance is typically comparable to, and sometimes much better than, manual parallelization. At the same time, these plans would require that the user parallelize significantly fewer regions of the program.

[1]  Chen Yang,et al.  A cost-driven compilation framework for speculative parallelization of sequential programs , 2004, PLDI '04.

[2]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[3]  L. Rauchwerger,et al.  The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization , 1999, IEEE Trans. Parallel Distributed Syst..

[4]  Nathan R. Tallent,et al.  Effective performance measurement and analysis of multithreaded applications , 2009, PPoPP '09.

[5]  Peng Wu,et al.  Compiler-Driven Dependence Profiling to Guide Program Parallelization , 2008, LCPC.

[6]  Manoj Kumar,et al.  Measuring Parallelism in Computation-Intensive Scientific/Engineering Applications , 1988, IEEE Trans. Computers.

[7]  Xiangyu Zhang,et al.  Alchemist: A Transparent Dependence Distance Profiling Infrastructure , 2009, 2009 International Symposium on Code Generation and Optimization.

[8]  Rajesh Bordawekar,et al.  Modeling optimistic concurrency using quantitative dependence analysis , 2008, PPOPP.

[9]  Wei Liu,et al.  POSH: a TLS compiler that exploits program structure , 2006, PPoPP '06.

[10]  Todd M. Austin,et al.  Dynamic dependency analysis of ordinary programs , 1992, ISCA '92.

[11]  Wilson C. Hsieh,et al.  A framework for determining useful parallelism , 1988, ICS '88.

[12]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[13]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[14]  Rajiv Gupta,et al.  Timestamped whole program path representation and its applications , 2001, PLDI '01.

[15]  James R. Larus,et al.  Loop-Level Parallelism in Numeric and Symbolic Programs , 1993, IEEE Trans. Parallel Distributed Syst..

[16]  Frank Tip,et al.  Refactoring for reentrancy , 2009, ESEC/FSE '09.

[17]  Qin Zhao,et al.  Umbra: efficient and scalable memory shadowing , 2010, CGO '10.

[18]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[19]  Jesús Labarta,et al.  Interfacing Computer Aided Parallelization and Performance Analysis , 2003, International Conference on Computational Science.

[20]  Andreas Zeller,et al.  Profiling Java programs for parallelism , 2009, 2009 ICSE Workshop on Multicore Software Engineering.

[21]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[22]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[23]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[24]  Steven Hall,et al.  Manipulating lossless video in the compressed domain , 2009, MM '09.

[25]  Yuxiong He,et al.  The Cilkview scalability analyzer , 2010, SPAA '10.

[26]  Kunle Olukotun,et al.  The Jrpm system for dynamically parallelizing Java programs , 2003, ISCA '03.

[27]  Amer Diwan,et al.  SUIF Explorer: an interactive and interprocedural parallelizer , 1999, PPoPP '99.

[28]  J. Mark Bull,et al.  A microbenchmark suite for OpenMP 2.0 , 2001, CARN.

[29]  Ken Kennedy,et al.  Interactive Parallel Programming using the ParaScope Editor , 1991, IEEE Trans. Parallel Distributed Syst..

[30]  Chen Ding,et al.  Fast Track: A Software System for Speculative Program Optimization , 2009, 2009 International Symposium on Code Generation and Optimization.

[31]  Rajiv Gupta,et al.  Copy or Discard execution model for speculative parallelization on multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[32]  Xiangyu Zhang,et al.  Efficient online detection of dynamic control dependence , 2007, ISSTA '07.

[33]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[34]  Michael D. Ernst,et al.  Refactoring sequential Java code for concurrency via concurrent libraries , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[35]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[36]  P. Sadayappan,et al.  Understanding parallelism-inhibiting dependences in sequential Java programs , 2010, 2010 IEEE International Conference on Software Maintenance.

[37]  Vivek Sarkar,et al.  X10: concurrent programming for modern architectures , 2007, PPOPP.

[38]  Yoichi Muraoka,et al.  On the Number of Operations Simultaneously Executable in Fortran-Like Programs and Their Resulting Speedup , 1972, IEEE Transactions on Computers.

[39]  Keshav Pingali,et al.  How much parallelism is there in irregular applications? , 2009, PPoPP '09.

[40]  Hyesoon Kim,et al.  SD3: A Scalable Approach to Dynamic Data-Dependence Profiling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[41]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[42]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[43]  Serge J. Belongie,et al.  SD-VBS: The San Diego Vision Benchmark Suite , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[44]  Saturnino Garcia,et al.  Kremlin: like gprof, but for parallelization , 2011, PPoPP '11.