Analysis of PARSEC workload scalability

PARSEC is a popular benchmark suite designed to facilitate the study of CMPs. It is composed of 13 parallel applications, each with an input set intended for native execution, as well as three reduced-size simulation input sets. Each benchmark also demarcates a Region of Interest (ROI) that indicates the parallel code in the application. The PARSEC developers state that users should model only the ROI when using simulation inputs; in other cases the native input set should be used to obtain results representative of full program execution. We analyzed the runtime scalability of PARSEC using real multiprocessor systems and present our results in this paper. For each benchmark we analyzed the runtime scalability of both the ROI and full execution for all the input sets. We found that for 6 of the benchmarks the ROI scalability matches that of the full program regardless of the input set used. For the remaining 7 benchmarks, for at least some of the input sets there is significant divergence between the scalability of the ROI and the full program. Three of these benchmarks have much lower scalability for the full program than the ROI, even when run with the native input set. We found that for most of the benchmarks the runtime scalability of the simulation inputs differs significantly from that of the native input set, both for the ROI and the full program.

[1]  Magnus Jahre,et al.  ParVec: vectorizing the PARSEC benchmark suite , 2015, Computing.

[2]  Nathan Clark,et al.  Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications , 2010, ISCA.

[3]  Laxmi N. Bhuyan,et al.  Thread reinforcer: Dynamically determining number of threads via OS level monitoring , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[4]  Bruce R. Childers,et al.  Inflation and deflation of self-adaptive applications , 2011, SEAMS '11.

[5]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Margaret Martonosi,et al.  Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[7]  Rafael Asenjo,et al.  Load balancing using work-stealing for pipeline parallelism in emerging applications , 2009, ICS.

[8]  Alexandra Fedorova,et al.  Deconstructing the overhead in parallel applications , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Sally A. McKee,et al.  Understanding PARSEC performance on contemporary CMPs , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[10]  Bruce R. Childers,et al.  Using utility prediction models to dynamically choose program thread counts , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[11]  Laxmi N. Bhuyan,et al.  Thread Tranquilizer: Dynamically reducing performance variation , 2012, TACO.

[12]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[13]  Lieven Eeckhout,et al.  Using cycle stacks to understand scaling bottlenecks in multi-threaded workloads , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[14]  Bruce R. Childers,et al.  Program affinity performance models for performance and utilization , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  David Eklov,et al.  A software based profiling method for obtaining speedup stacks on commodity multi-cores , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[16]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[17]  Gurindar S. Sohi,et al.  Adaptive, efficient, parallel execution of parallel programs , 2014, PLDI.

[18]  Sourav Dutta,et al.  Classifying Performance Bottlenecks in Multi-threaded Applications , 2014, 2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs.

[19]  Christian Bienia,et al.  PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors , 2009 .

[20]  Magnus Jahre,et al.  Optimized hardware for suboptimal software: The case for SIMD-aware benchmarks , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[21]  Stijn Eyerman,et al.  Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[22]  Kai Li,et al.  PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on Chip-Multiprocessors , 2008, 2008 IEEE International Symposium on Workload Characterization.

[23]  Simon W. Moore,et al.  A communication characterisation of Splash-2 and Parsec , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[24]  Kai Li,et al.  Fidelity and scaling of the PARSEC benchmark inputs , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[25]  Emery D. Berger,et al.  Coz: finding code that counts with causal profiling , 2015, USENIX Annual Technical Conference.