Accelerating Sequential Applications on CMPs Using Core Spilling

Chip multiprocessors (CMPs) provide a scalable means of exploiting thread-level parallelism for multitasking or multithreaded applications. However, single-threaded applications will have difficulty dynamically leveraging the statically partitioned resources in a CMP. Such sequential applications may be difficult to statically decompose into threads or may simply be-a legacy code where recompilation is not possible or cost-effective. We present a novel approach to dynamically accelerate the performance of sequential application(s) on multiple cores. Execution is allowed to spill from one core to another when resources on one core have been exhausted. We propose two techniques to enable low-overhead migration between cores: prespilling and locality-based filtering. We develop and analyze an arbitration mechanism to intelligently allocate cores among a set of sequential applications on a CMP. On average, core spilling on an eight-core CMP can accelerate single-threaded performance by 35 percent. We further explore an eight-core CMP running a multiple application workload composed of the entire SPEC 2000 benchmark suite in various combinations and arrival times. Using core spilling to accelerate the current set of running applications in cases where there are idle cores, we achieve up to a 40 percent improvement in performance.

[1]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[2]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[3]  Gurindar S. Sohi,et al.  Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors , 1992, MICRO.

[4]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[5]  Haitham Akkary,et al.  Checkpoint processing and recovery: towards scalable large instruction window processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[6]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor architecture , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[7]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[8]  Norman P. Jouppi,et al.  The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[9]  Margaret Martonosi,et al.  Multipath execution: opportunities and limits , 1998, ICS '98.

[10]  Saman P. Amarasinghe Multicores from the Compiler's Perspective: A Blessing or a Curse? , 2005, CGO.

[11]  David I. August,et al.  Microarchitectural exploration with Liberty , 2002, MICRO 35.

[12]  Jaehyuk Huh,et al.  Exploring the design space of future CMPs , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[13]  M. Bohr Interconnect scaling-the real limiter to high performance ULSI , 1995, Proceedings of International Electron Devices Meeting.

[14]  Balaram Sinharoy,et al.  IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.

[15]  Haitham Akkary,et al.  Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.

[16]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[17]  Brad Calder,et al.  Threaded multiple path execution , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[18]  Lixin Zhang,et al.  Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[19]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[20]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[21]  Krste Asanovic,et al.  Reducing power density through activity migration , 2003, ISLPED '03.

[22]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, MICRO.

[23]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[24]  Gurindar S. Sohi,et al.  Master/Slave Speculative Parallelization , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[25]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[26]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[27]  Dean M. Tullsen,et al.  Multithreaded value prediction , 2005, 11th International Symposium on High-Performance Computer Architecture.

[28]  Mateo Valero,et al.  Toward kilo-instruction processors , 2004, TACO.

[29]  Norman P. Jouppi,et al.  Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction , 2003, MICRO.

[30]  Haitham Akkary,et al.  Continual flow pipelines: achieving resource-efficient latency tolerance , 2004, IEEE Micro.

[31]  Yale N. Patt,et al.  Select-free instruction scheduling logic , 2001, MICRO.

[32]  Rajeev Balasubramonian,et al.  Dynamically allocating processor resources between nearby and distant ILP , 2001, ISCA 2001.

[33]  Mateo Valero,et al.  Delaying physical register allocation through virtual-physical registers , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[34]  Gurindar S. Sohi,et al.  Speculative data-driven multithreading , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[35]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[36]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[37]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[38]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[39]  Eric Rotenberg,et al.  Slipstream execution mode for CMP-based multiprocessors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[40]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[41]  Todd M. Austin,et al.  Cyclone: a broadcast-free dynamic instruction scheduler with selective replay , 2003, ISCA '03.

[42]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[43]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[44]  Eric Rotenberg,et al.  A large, fast instruction window for tolerating cache misses , 2002, ISCA.

[45]  José E. Moreira,et al.  Evaluation of a multithreaded architecture for cellular computing , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[46]  T. N. Vijaykumar,et al.  Heat-and-run: leveraging SMT and CMP to manage power density through the operating system , 2004, ASPLOS XI.

[47]  Dean M. Tullsen,et al.  Clustered multithreaded architectures - pursuing both IPC and cycle time , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..