Improving the performance and power efficiency of shared helpers in CMPs

Technology scaling trends have forced designers to consider alternatives to deeply pipelining aggressive cores with large amounts of performance accelerating hardware. One alternative is a small, simple core that can be augmented with latency tolerant helpers. As the demands placed on the processor core varies between applications, and even between phases of an application, the benefit seen from any set of helpers will vary tremendously. If there is a single core, these auxiliary structures can be turned on and off dynamically to tune the energy/performance of the machine to the needs of the running application.As more of the processor is broken down into helpers, and additional cores are added to a single chip that can potentially share helpers, the decisions that are made about these structures become increasingly important. In this paper we describe the need for methods that effectively manage these helpers. Our counter-based approach can dynamically turn off three helpers on average while staying within 2% of the performance when running with all helpers. In a multicore environment, our intelligent and exible sharing of helper provides an average 24% speedup compared to static sharing in conjoined cores. Furthermore we show a benefit from constructively sharing helpers among multiple cores running the same application.

[1]  Yale N. Patt,et al.  Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[2]  Chris Wilkerson,et al.  Locality vs. criticality , 2001, ISCA 2001.

[3]  William H. Mangione-Smith,et al.  The filter cache: an energy efficient memory structure , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[4]  Yale N. Patt,et al.  A comprehensive instruction fetch mechanism for a processor supporting speculative execution , 1992, MICRO 1992.

[5]  Ho-Seop Kim,et al.  An instruction set and microarchitecture for instruction level distributed processing , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[6]  John Paul Shen,et al.  Helper threads via virtual multithreading , 2004, IEEE Micro.

[7]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[8]  Trung A. Diep,et al.  A case for shared instruction cache on chip multiprocessors running OLTP , 2004, SIGARCH Comput. Archit. News.

[9]  B. Calder,et al.  A scalable front-end architecture for fast instruction delivery , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[10]  Rajeev Balasubramonian,et al.  Reducing the complexity of the register file in dynamic superscalar processors , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[11]  André Seznec,et al.  CASH: Revisiting Hardware Sharing in Single-Chip Parallel Processors , 2004, J. Instr. Level Parallelism.

[12]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[13]  Kai Wang,et al.  Highly accurate data value prediction using hybrid predictors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[14]  M TullsenDean,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000 .

[15]  Glenn Reinman,et al.  An Evaluation of Deeply Decoupled Cores , 2006, J. Instr. Level Parallelism.

[16]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[17]  James E. Smith,et al.  Instruction Level Distributed Processing , 2000, HiPC.

[18]  Norman P. Jouppi,et al.  Conjoined-Core Chip Multiprocessing , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[19]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[20]  T. Sherwood,et al.  Predictor-directed stream buffers , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[21]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[22]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[23]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .