Estimating and Exploiting Potential Parallelism by Source-Level Dependence Profiling

Manual parallelization of programs is known to be difficult and error-prone, and there are currently few ways to measure the amount of potential parallelism in the original sequential code. We present an extension of Embla, a Valgrind-based dependence profiler that links dynamic dependences back to source code. This new tool estimates potential task-level parallelism in a sequential program and helps programmers exploit it at the source level. Using the popular forkjoin model, our tool provides a realistic estimate of potential speed-up for parallelization with frameworks like Cilk, TBB or OpenMP 3.0. Estimates can be given for several different parallelization models, varying in programmer effort and capabilities required of the underlying implementation. Our tool also outputs source-level dependence information to aid the parallelization of programs with lots of inherent parallelism, as well as critical paths to suggest algorithmic rewrites of programs with little of it. We validate our claims by running our tool over serial elisions of sample Cilk programs, finding additional inherent parallelism not exploited by the Cilk code, as well as over serial C benchmarks where the profiling results suggest parallelism-enhancing algorithmic rewrites.

[1]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[2]  Konstantin Popov,et al.  Embla - Data Dependence Profiling for Parallel Programming , 2008, 2008 International Conference on Complex, Intelligent and Software Intensive Systems.

[3]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[4]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[5]  Antonia Zhai,et al.  The STAMPede approach to thread-level speculation , 2005, TOCS.

[6]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[7]  Peng Wu,et al.  Compiler-Driven Dependence Profiling to Guide Program Parallelization , 2008, LCPC.

[8]  Marvin V. Zelkowitz,et al.  Programming Languages: Design and Implementation , 1975 .

[9]  Per Stenström,et al.  Limits on speculative module-level parallelism in imperative and object-oriented programs on CMP platforms , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[10]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[11]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[12]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[13]  Monica S. Lam,et al.  In search of speculative thread-level parallelism , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[14]  John M. Mellor-Crummey,et al.  On-the-fly detection of data races for programs with nested fork-join parallelism , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[15]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[16]  Michael Burrows,et al.  Eraser: a dynamic data race detector for multi-threaded programs , 1997, TOCS.

[17]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[18]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[19]  Lawrence Rauchwerger,et al.  Polaris: The Next Generation in Parallelizing Compilers , 2000 .

[20]  Brad Calder,et al.  Limits of task-based parallelism in irregular applications , 2000, CARN.

[21]  Xiangyu Zhang,et al.  Alchemist: A Transparent Dependence Distance Profiling Infrastructure , 2009, 2009 International Symposium on Code Generation and Optimization.

[22]  James R. Larus,et al.  Loop-Level Parallelism in Numeric and Symbolic Programs , 1993, IEEE Trans. Parallel Distributed Syst..

[23]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[24]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[25]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[26]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[27]  Wei Liu,et al.  POSH: a TLS compiler that exploits program structure , 2006, PPoPP '06.