Multithreaded Instruction Sharing

We show that when multi-threaded benchmarks are executed on a Chip Multiprocessor (CMP), the threads typically execute identical instructions at nearly the same time. When multiple threads are all executing identical instructions (same PC, same source operands, and same source values) at nearly the same time, we recognize that the computation can be performed by one thread, and the results can be shared with the other threads, saving critical execution resources and bandwidth for other instructions. We study these thread properties, and evaluate a hardware implementation that recognizes and exploits instruction-similarity. In our experiments, we find that for one thread of a multi-threaded benchmark, about 20% of instructions are identical to nearby instructions in other running threads. Evaluation of our proposed sharing techniques on a high throughput, in-order, Simultaneous Multithreaded architecture achieves a 10% mean (32% peak) increase in processor throughput. As with most performance enhancing techniques, the performance is achieved at the expense of additional power. In Instruction Sharing the increase in per-core power is about 12%, which is due in part to hardware modifications but also to higher utilization of existing hardware. The core area is increased by approximately 15%, of which 13% is due to increasing core contexts. 2% area increase is due to banking the register file, adding match logic, and adding small associative match tables.

[1]  M TullsenDean,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000 .

[2]  Paul D. Franzon,et al.  FreePDK: An Open-Source Variation-Aware Design Kit , 2007, 2007 IEEE International Conference on Microelectronic Systems Education (MSE'07).

[3]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[4]  Gurindar S. Sohi,et al.  An empirical analysis of instruction repetition , 1998, ASPLOS VIII.

[5]  Antonio González,et al.  Trace-level reuse , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[6]  Norman P. Jouppi,et al.  Conjoined-Core Chip Multiprocessing , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[7]  Amir Roth,et al.  Three extensions to register integration , 2002, MICRO 35.

[8]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[9]  Henry P. Moreton,et al.  The GeForce 6800 , 2005, IEEE Micro.

[10]  Andreas Moshovos,et al.  Speculative Memory Cloaking and Bypassing , 1999, International Journal of Parallel Programming.

[11]  Gurindar S. Sohi,et al.  Register integration: a simple and efficient implementation of squash reuse , 2000, MICRO 33.

[12]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[13]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Thierry Gautier,et al.  KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors , 2007, PASCO '07.

[15]  Steven K. Reinhardt,et al.  The impact of resource partitioning on SMT processors , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[16]  Frederic T. Chong,et al.  Multi-execution: multicore caching for data-similar executions , 2009, ISCA '09.

[17]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[18]  Matthew Curtis-Maury,et al.  Integrating multiple forms of multithreaded execution on multi-SMT systems: a study with scientific applications , 2005, Second International Conference on the Quantitative Evaluation of Systems (QEST'05).

[19]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Greg Grohoski Niagara-2: A highly threaded server-on-a-chip , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[21]  Antonio González,et al.  Dynamic removal of redundant computations , 1999, ICS '99.

[22]  Yao Zhang,et al.  Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations , 2009, Euro-Par Workshops.

[23]  G.S. Sohi,et al.  Dynamic instruction reuse , 1997, ISCA '97.

[24]  Guy E. Blelloch,et al.  Scheduling threads for constructive cache sharing on CMPs , 2007, SPAA '07.

[25]  Stéphan Jourdan,et al.  A novel renaming scheme to exploit value temporal locality through physical register reuse and unification , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[26]  Israel Koren,et al.  An Adaptive Resource Partitioning Algorithm for SMT processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27]  Larry Rudolph,et al.  Accelerating multi-media processing by implementing memoing in multiplication and division units , 1998, ASPLOS VIII.