Performance scalability and dynamic behavior of Parsec benchmarks on many-core processors

The Parsec benchmark suite is widely used in evaluation of parallel architectures, both existing and novel, the latter through simulation. In particular, it is used for evaluation of highly parallel architectures. It is well known that parallelism bottlenecks occur both in the architecture, (e.g., shared-resource contention) and in the algorithm, (e.g., data-dependency). In this paper we study the latter, i.e., the inherent parallelism scalability and the dynamic behavior of the benchmark programs themselves, independently of the architecture. To this end, we present a new simulator that performs efficient, functionally accurate, simulation of a hypothetical ideal parallel architecture with no parallelism bottlenecks, where any measured parallelism limitation is necessarily due the benchmark itself. By applying this methodology to a continuum of simulated machines, ranging from a few processors to thousands of processors, we characterize the dynamic behavior and scalability of different benchmarks. We find that only a quarter of the Parsec benchmarks truly scale well to hundreds of processors. Moreover, somewhat surprisingly, we find the Amdahl effects are responsible for lack of scaling in only about half the non-scalable benchmarks. The rest are limited by their inability to produce sufficient work for all cores, and the others benchmarks’ scalability is limited by Amdahl effects.

[1]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[2]  David A. Wood,et al.  IPC Considered Harmful for Multiprocessor Workloads , 2006, IEEE Micro.

[3]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[4]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[5]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[6]  Kuei-Chung Chang,et al.  Improving performance of multi-core NUCA coherent systems using NoC-assisted mechanisms , 2012, The Journal of Supercomputing.

[7]  David A. Patterson,et al.  RAMP gold: An FPGA-based architecture simulator for multiprocessors , 2010, Design Automation Conference.

[8]  Christian Bienia,et al.  PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors , 2009 .

[9]  Stefanos Kaxiras,et al.  Multicore Cache Simulations Using Heterogeneous Computing on General Purpose and Graphics Processors , 2011, 2011 14th Euromicro Conference on Digital System Design.

[10]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Alberto Ros,et al.  Self-related traces: An alternative to full-system simulation for NoCs , 2011, 2011 International Conference on High Performance Computing & Simulation.

[12]  Shunfei Chen,et al.  MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[13]  Rami G. Melhem,et al.  Scalable Multi-cache Simulation Using GPUs , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[14]  Sally A. McKee,et al.  Understanding PARSEC performance on contemporary CMPs , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[15]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.