Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications

Multi-threaded workloads typically show sublinear speedup on multi-core hardware, i.e., the achieved speedup is not proportional to the number of cores and threads. Sublinear scaling may have multiple causes, such as poorly scalable synchronization leading to spinning and/or yielding, and interference in shared resources such as the last-level cache (LLC) as well as the main memory subsystem. It is vital for programmers and processor designers to understand scaling bottlenecks in existing and emerging workloads in order to optimize application performance and design future hardware. In this paper, we propose the speedup stack, which quantifies the impact of the various scaling delimiters on multi-threaded application speedup in a single stack. We describe a mechanism for computing speedup stacks on a multi-core processor, and we find speedup stacks to be accurate within 5.1% on average for sixteen-threaded applications. We present several use cases: we discuss how speedup stacks can be used to identify scaling bottlenecks, classify benchmarks, optimize performance, and understand LLC performance.

[1]  James E. Smith,et al.  Advanced Micro Devices , 2005 .

[2]  Wenguang Chen,et al.  Cache Sharing Management for Performance Fairness in Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[3]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[4]  Tong Li,et al.  Spin detection hardware for improved management of multithreaded systems , 2006, IEEE Transactions on Parallel and Distributed Systems.

[5]  Rajiv Gupta,et al.  Dynamic recognition of synchronization operations for improved data race detection , 2008, ISSTA '08.

[6]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[7]  James E. Smith,et al.  A performance counter architecture for computing accurate CPI components , 2006, ASPLOS XII.

[8]  O. Mutlu,et al.  Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS XV.

[9]  Tao Li,et al.  Informed Microarchitecture Design Space Exploration Using Workload Dynamics , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[10]  Stijn Eyerman,et al.  Per-thread cycle accounting in multicore processors , 2013, TACO.

[11]  Francisco J. Cazorla,et al.  ITCA: Inter-task Conflict-Aware CPU Accounting for CMPs , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[12]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[14]  Onur Mutlu,et al.  Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS 2010.

[15]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[16]  Margaret Martonosi,et al.  Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors , 2009, ISCA '09.

[17]  Stijn Eyerman,et al.  Per-thread cycle accounting in SMT processors , 2009, ASPLOS.

[18]  Philip G. Emma,et al.  Understanding some simple processor-performance limits , 1997, IBM J. Res. Dev..