Predicting Potential Speedup of Serial Code via Lightweight Profiling and Emulations with Memory Performance Model

We achieve very small runtime overhead: approximately a 1.2-10 times slowdown and moderate memory consumption. We demonstrate the effectiveness of Parallel Prophet in eight benchmarks in the Omp SCR and NAS Parallel benchmarks by comparing our predictions with actual parallelized code. Our simple memory model also identifies performance limitations resulting from the memory system contention. We present Parallel Prophet, which projects potential parallel speedup from an annotated serial program before actual parallelization. Programmers want to see how much speedup could be obtained prior to investing time and effort to write parallel code. With Parallel Prophet, programmers simply insert annotations that describe the parallel behavior of the serial program. Parallel Prophet then uses lightweight interval profiling and dynamic emulations to predict potential performance benefit. Parallel Prophet models many realistic features of parallel programs: unbalanced workload, multiple critical sections, nested and recursive parallelism, and specific thread schedulings and paradigms, which are hard to model in previous approaches. Furthermore, Parallel Prophet predicts speedup saturation resulting from memory and caches by onitoring cache hit ratio and bandwidth consumption in a serial program. We achieve very small runtime overhead: approximately a 1.2-10 times slowdown and moderate memory consumption. We demonstrate the effectiveness of Parallel Prophet in eight benchmarks in the OmpSCR and NAS Parallel benchmarks by comparing our predictions with actual parallelized code. Our simple memory model also identifies performance limitations resulting from memory system contention.

[1]  O. Mutlu,et al.  Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS XV.

[2]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[3]  Peng Wu,et al.  Compiler-Driven Dependence Profiling to Guide Program Parallelization , 2008, LCPC.

[4]  Yuxiong He,et al.  The Cilkview scalability analyzer , 2010, SPAA '10.

[5]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[6]  Stijn Eyerman,et al.  Modeling critical sections in Amdahl's law and its implications for multicore design , 2010, ISCA '10.

[7]  Alan H. Karp,et al.  Measuring parallel processor performance , 1990, CACM.

[8]  Hyesoon Kim,et al.  SD3: A Scalable Approach to Dynamic Data-Dependence Profiling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[10]  Saturnino Garcia,et al.  Kismet: parallel speedup estimates for serial programs , 2011, OOPSLA '11.

[11]  Hsien-Hsin S. Lee,et al.  Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[12]  Michael Frumkin,et al.  The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .

[13]  Jian Li,et al.  Memory Latency Reduction via Thread Throttling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[14]  James R. Larus,et al.  Loop-Level Parallelism in Numeric and Symbolic Programs , 1993, IEEE Trans. Parallel Distributed Syst..

[15]  Vassilios V. Dimakopoulos,et al.  A Microbenchmark Study of OpenMP Overheads under Nested Parallelism , 2008, IWOMP.

[16]  Mary K. Vernon,et al.  Parallel program performance prediction using deterministic task graph analysis , 2004, TOCS.

[17]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[18]  Xiangyu Zhang,et al.  Alchemist: A Transparent Dependence Distance Profiling Infrastructure , 2009, 2009 International Symposium on Code Generation and Optimization.

[19]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[20]  Mingyu Chen,et al.  Extending Amdahl's law in the multicore era , 2009, SIGMETRICS Perform. Evaluation Rev..

[21]  William Thies,et al.  A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[22]  J. Mark Bull,et al.  A microbenchmark suite for OpenMP 2.0 , 2001, CARN.

[23]  Rajesh Bordawekar,et al.  Modeling optimistic concurrency using quantitative dependence analysis , 2008, PPOPP.