Parallel speedup estimates for serial programs

Software engineers now face the difficult task of parallelizing serial programs for parallel execution on multicore processors. Parallelization is a complex task that typically requires considerable time and effort. However, even after extensive engineering efforts, the resulting speedup is often fundamentally limited due to the lack of parallelism in the target program or the inability of the target platform to exploit existing parallelism. Unfortunately, little guidance is available as to how much benefit may come from parallelization, making it hard for a programmer to answer this critical question: " Should I parallelize my code?". In this dissertation, we examine the design and implementation of Kismet, a tool that creates parallel speedup estimates for unparallelized serial programs. Our approach differs from previous approaches in that it does not require any changes or manual analysis of the serial program. This difference allows quick profitability analysis of a program, helping programmers make informed decisions in the initial stages of parallelization. To provide parallel speedup estimates from serial programs, we developed a dynamic program analysis named hierarchical critical path analysis (HCPA). HCPA extends a classic technique called critical path analysis to quantify localized parallelism of each region in a program. Based on the parallelism information from HCPA, Kismet incorporates key parallelization constraints that can significantly affect the parallel speedup, providing realistic speedup upperbounds. The results are compelling. Kismet can significantly improve the accuracy of parallel speedup estimates relative to prior work based on critical path analysis.

[1]  Saturnino Garcia,et al.  Kremlin: like gprof, but for parallelization , 2011, PPoPP '11.

[2]  James C. Hoe,et al.  Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  Gabriel H. Loh A time-stamping algorithm for efficient performance estimation of superscalar processors , 2001, SIGMETRICS '01.

[4]  Margaret Martonosi,et al.  Integrating performance monitoring and communication in parallel computers , 1996, SIGMETRICS '96.

[5]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[6]  Wenguang Chen,et al.  PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node , 2010, PPoPP '10.

[7]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[8]  Qin Zhao,et al.  Practical memory checking with Dr. Memory , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[9]  D.A. Reed,et al.  An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[10]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[11]  Manoj Kumar,et al.  Measuring Parallelism in Computation-Intensive Scientific/Engineering Applications , 1988, IEEE Trans. Computers.

[12]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[13]  C. Luk,et al.  Prospector : A Dynamic Data-Dependence Profiler To Help Parallel Programming , 2010 .

[14]  Anant Agarwal,et al.  Scalar operand networks , 2005, IEEE Transactions on Parallel and Distributed Systems.

[15]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[16]  Kunle Olukotun,et al.  Exposing speculative thread parallelism in SPEC2000 , 2005, PPoPP.

[17]  Pranith Kumar,et al.  Predicting Potential Speedup of Serial Code via Lightweight Profiling and Emulations with Memory Performance Model , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[18]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[19]  Wei Xu,et al.  Taint-Enhanced Policy Enforcement: A Practical Approach to Defeat a Wide Range of Attacks , 2006, USENIX Security Symposium.

[20]  Nicholas Nethercote,et al.  How to shadow every byte of memory used by a program , 2007, VEE '07.

[21]  Keshav Pingali,et al.  How much parallelism is there in irregular applications? , 2009, PPoPP '09.

[22]  Hyesoon Kim,et al.  SD3: A Scalable Approach to Dynamic Data-Dependence Profiling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[24]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[25]  Guang R. Gao,et al.  On the limits of program parallelism and its smoothability , 1992, MICRO.

[26]  Lawrence Rauchwerger,et al.  Measuring limits of parallelism and characterizing its vulnerability to resource constraints , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[27]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[28]  Ben Lee,et al.  Performance Evaluation of Dynamic Speculative Multithreading with the Cascadia Architecture , 2010, IEEE Transactions on Parallel and Distributed Systems.

[29]  Cheng Wang,et al.  LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[30]  Michael Bedford Taylor,et al.  Tiled microprocessors , 2007 .

[31]  Qin Zhao,et al.  Umbra: efficient and scalable memory shadowing , 2010, CGO '10.

[32]  Ware Myers Supercomputing 91 , 1992 .

[33]  Feng Liu,et al.  Scalable Speculative Parallelization on Commodity Clusters , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[34]  Yuxiong He,et al.  The Cilkview scalability analyzer , 2010, SPAA '10.

[35]  Vivek Sarkar,et al.  The Raw Compiler Project , 1999 .

[36]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[37]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[38]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[39]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[40]  Daniel A. Reed,et al.  SvPablo: A multi-language architecture-independent performance analysis system , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[41]  Xiangyu Zhang,et al.  Alchemist: A Transparent Dependence Distance Profiling Infrastructure , 2009, 2009 International Symposium on Code Generation and Optimization.

[42]  Easwaran Raman,et al.  Parallel-stage decoupled software pipelining , 2008, CGO '08.

[43]  Victor Lee,et al.  The RAW benchmark suite: computation structures for general purpose computing , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[44]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[45]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[46]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[47]  Saturnino Garcia,et al.  Parkour: Parallel Speedup Estimates for Serial Programs , 2011, HotPar.

[48]  Martin Schulz,et al.  A regression-based approach to scalability prediction , 2008, ICS '08.

[49]  Todd M. Austin,et al.  Dynamic dependency analysis of ordinary programs , 1992, ISCA '92.

[50]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[51]  Lieven Eeckhout,et al.  Performance prediction based on inherent program similarity , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[52]  Steven Swanson,et al.  GreenDroid: A mobile application processor for a future of dark silicon , 2010, 2010 IEEE Hot Chips 22 Symposium (HCS).

[53]  Amer Diwan,et al.  SUIF Explorer: an interactive and interprocedural parallelizer , 1999, PPoPP '99.

[54]  Qin Zhao,et al.  Efficient memory shadowing for 64-bit architectures , 2010, ISMM '10.

[55]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[56]  Rajiv Gupta,et al.  Timestamped whole program path representation and its applications , 2001, PLDI '01.

[57]  James R. Larus,et al.  Loop-Level Parallelism in Numeric and Symbolic Programs , 1993, IEEE Trans. Parallel Distributed Syst..

[58]  John L. Hennessy,et al.  Efficient performance prediction for modern microprocessors , 2000, SIGMETRICS '00.

[59]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[60]  Bei Yu,et al.  TaintTrace: Efficient Flow Tracing with Dynamic Binary Rewriting , 2006, 11th IEEE Symposium on Computers and Communications (ISCC'06).

[61]  Nicholas Nethercote,et al.  Using Valgrind to Detect Undefined Value Errors with Bit-Precision , 2005, USENIX Annual Technical Conference, General Track.

[62]  J. Mark Bull,et al.  A microbenchmark suite for OpenMP 2.0 , 2001, CARN.

[63]  Li Zhao,et al.  Exploring Large-Scale CMP Architectures Using ManySim , 2007, IEEE Micro.

[64]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[65]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[66]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[67]  Yoichi Muraoka,et al.  On the Number of Operations Simultaneously Executable in Fortran-Like Programs and Their Resulting Speedup , 1972, IEEE Transactions on Computers.

[68]  David Evans,et al.  Towards Differential Program Analysis , 2022 .