Speculative Precomputation on Chip Multiprocessors

Previous work on speculative precomputation (SP) on simultaneous multithreaded (SMT) architectures has shown significant benefits. The SP techniques improve singlethreaded program performance by utilizing otherwise idle thread contexts to run “helper threads”, which prefetch critical data into shared caches and reduce the time the “main thread” stalls waiting for long latency outstanding loads. This technique effectively exploits the parallel thread contexts and the data cache sharing at all levels of the memory hierarchy that SMT provides. Chip multiprocessor (CMP) architectures also feature parallel thread contexts, but do not share caches near execution resources. In this paper, we first investigate SP on a basic CMP and show that while the existing SP techniques can provide performance improvements for single-threaded application on such CMP architectures, they fall short of the benefits provided on SMT architectures due to the reduced degree of cache sharing. We then propose and evaluate several simple enhancements to the basic CMP architecture, which can increase the speedup from using SP by an additional 10 to 12%.

[1]  Yale N. Patt,et al.  An effective programmable prefetch engine for on-chip caches , 1995, MICRO 1995.

[2]  Harsh Sharangpani,et al.  Itanium Processor Microarchitecture , 2000, IEEE Micro.

[3]  Joel S. Emer,et al.  Simultaneous multithreading: multiplying alpha performance , 1999 .

[4]  Eric Rotenberg,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[5]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[6]  Dean M. Tullsen,et al.  Fellowship - Simulation And Modeling Of A Simultaneous Multithreading Processor , 1996, Int. CMG Conference.

[7]  S. Abraham,et al.  Predicating Load Latencies Using Cache Profiling , 1996 .

[8]  Martin C. Carlisle,et al.  Olden: parallelizing programs with dynamic data structures on distributed-memory machines , 1996 .

[9]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[10]  Rakesh Krishnaiyer,et al.  An Advanced Optimizer for the IA-64 Architecture , 2000, IEEE Micro.

[11]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[12]  John Paul Shen,et al.  Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[13]  John Paul Shen,et al.  Post-pass binary adaptation for software-based speculative precomputation , 2002, PLDI '02.

[14]  Gurindar S. Sohi,et al.  Speculative data-driven multithreading , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[15]  William J. Dally,et al.  Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[16]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[17]  D. Scott Wills,et al.  Architecture of the Atlas chip-multiprocessor: dynamically parallelizing irregular applications , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[18]  John Paul Shen,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[19]  Kunle Olukotun,et al.  The Stanford Hydra CMP , 2000, IEEE Micro.

[20]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[21]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[22]  Hans Mulder,et al.  Introducing the IA-64 Architecture , 2000, IEEE Micro.

[23]  Tien-Fu Chen,et al.  Alternative implementations of hybrid branch predictors , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[24]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[25]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[26]  Weihaw Chuang,et al.  The Intel IA-64 Compiler Code Generator , 2000, IEEE Micro.

[27]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[28]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.