Kismet: parallel speedup estimates for serial programs

Software engineers now face the difficult task of refactoring serial programs for parallel execution on multicore processors. Currently, they are offered little guidance as to how much benefit may come from this task, or how close they are to the best possible parallelization. This paper presents Kismet, a tool that creates parallel speedup estimates for unparallelized serial programs. Kismet differs from previous approaches in that it does not require any manual analysis or modification of the program. This difference allows quick analysis of many programs, avoiding wasted engineering effort on those that are fundamentally limited. To accomplish this task, Kismet builds upon the hierarchical critical path analysis (HCPA) technique, a recently developed dynamic analysis that localizes parallelism to each of the potentially nested regions in the target program. It then uses a parallel execution time model to compute an approximate upper bound for performance, modeling constraints that stem from both hardware parameters and internal program structure. Our evaluation applies Kismet to eight high-parallelism NAS Parallel Benchmarks running on a 32-core AMD multicore system, five low-parallelism SpecInt benchmarks, and six medium-parallelism benchmarks running on the finegrained MIT Raw processor. The results are compelling. Kismet is able to significantly improve the accuracy of parallel speedup estimates relative to prior work based on critical path analysis.

[1]  Michael Bedford Taylor,et al.  Design decision in the implementation of a raw architecture workstation , 1999 .

[2]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[3]  Kunle Olukotun,et al.  Exposing speculative thread parallelism in SPEC2000 , 2005, PPoPP.

[4]  Victor Lee,et al.  The RAW benchmark suite: computation structures for general purpose computing , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[5]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[6]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[7]  Nathan R. Tallent,et al.  Effective performance measurement and analysis of multithreaded applications , 2009, PPoPP '09.

[8]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[9]  Nicholas Nethercote,et al.  How to shadow every byte of memory used by a program , 2007, VEE '07.

[10]  Lawrence Rauchwerger,et al.  Measuring limits of parallelism and characterizing its vulnerability to resource constraints , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[11]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[12]  Ben Lee,et al.  Performance Evaluation of Dynamic Speculative Multithreading with the Cascadia Architecture , 2010, IEEE Transactions on Parallel and Distributed Systems.

[13]  Feng Liu,et al.  Scalable Speculative Parallelization on Commodity Clusters , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[14]  Parag A. Pathak,et al.  Massachusetts Institute of Technology , 1964, Nature.

[15]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[16]  Amer Diwan,et al.  SUIF Explorer: an interactive and interprocedural parallelizer , 1999, PPoPP '99.

[17]  D.A. Reed,et al.  An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[18]  Vivek Sarkar,et al.  The Raw Compiler Project , 1999 .

[19]  Saturnino Garcia,et al.  Parkour: Parallel Speedup Estimates for Serial Programs , 2011, HotPar.

[20]  Martin Schulz,et al.  A regression-based approach to scalability prediction , 2008, ICS '08.

[21]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[22]  Daniel A. Reed,et al.  SvPablo: A multi-language architecture-independent performance analysis system , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[23]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[24]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[25]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[26]  Manoj Kumar,et al.  Measuring Parallelism in Computation-Intensive Scientific/Engineering Applications , 1988, IEEE Trans. Computers.

[27]  Gabriel H. Loh A time-stamping algorithm for efficient performance estimation of superscalar processors , 2001, SIGMETRICS '01.

[28]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[29]  Guang R. Gao,et al.  On the limits of program parallelism and its smoothability , 1992, MICRO.

[30]  Easwaran Raman,et al.  Parallel-stage decoupled software pipelining , 2008, CGO '08.

[31]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[32]  Lieven Eeckhout,et al.  Performance prediction based on inherent program similarity , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33]  John L. Hennessy,et al.  Efficient performance prediction for modern microprocessors , 2000, SIGMETRICS '00.

[34]  Steven Swanson,et al.  GreenDroid: A mobile application processor for a future of dark silicon , 2010, 2010 IEEE Hot Chips 22 Symposium (HCS).

[35]  Saturnino Garcia,et al.  Kremlin: like gprof, but for parallelization , 2011, PPoPP '11.

[36]  Qin Zhao,et al.  Efficient memory shadowing for 64-bit architectures , 2010, ISMM '10.

[37]  Michael Bedford Taylor,et al.  Tiled microprocessors , 2007 .

[38]  Qin Zhao,et al.  Umbra: efficient and scalable memory shadowing , 2010, CGO '10.

[39]  James C. Hoe,et al.  Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[40]  Margaret Martonosi,et al.  Integrating performance monitoring and communication in parallel computers , 1996, SIGMETRICS '96.

[41]  Wenguang Chen,et al.  PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node , 2010, PPoPP '10.

[42]  Keshav Pingali,et al.  How much parallelism is there in irregular applications? , 2009, PPoPP '09.

[43]  Hyesoon Kim,et al.  SD3: A Scalable Approach to Dynamic Data-Dependence Profiling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[44]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[45]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[46]  Rajiv Gupta,et al.  Timestamped whole program path representation and its applications , 2001, PLDI '01.

[47]  James R. Larus,et al.  Loop-Level Parallelism in Numeric and Symbolic Programs , 1993, IEEE Trans. Parallel Distributed Syst..

[48]  Anant Agarwal,et al.  Scalar operand networks , 2005, IEEE Transactions on Parallel and Distributed Systems.

[49]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[50]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[51]  C. Luk,et al.  Prospector : A Dynamic Data-Dependence Profiler To Help Parallel Programming , 2010 .

[52]  Xiangyu Zhang,et al.  Alchemist: A Transparent Dependence Distance Profiling Infrastructure , 2009, 2009 International Symposium on Code Generation and Optimization.

[53]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[54]  Todd M. Austin,et al.  Dynamic dependency analysis of ordinary programs , 1992, ISCA '92.

[55]  Yuxiong He,et al.  The Cilkview scalability analyzer , 2010, SPAA '10.

[56]  J. Mark Bull,et al.  A microbenchmark suite for OpenMP 2.0 , 2001, CARN.

[57]  Li Zhao,et al.  Exploring Large-Scale CMP Architectures Using ManySim , 2007, IEEE Micro.

[58]  Saturnino Garcia,et al.  Bridging the Parallelization Gap : Automating Parallelism Discovery and Planning , 2010 .

[59]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[60]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.