On the importance of optimizing the configuration of stream prefetchers

This paper provides a detailed analysis of how the parameters of hardware prefetchers affect the memory system performance. In particular, we found the configuration of the frequently used stream prefetcher to have a major impact on the runtime, making parameter optimizations imperative when comparing a stream prefetcher with other prefetching techniques. For example, we show that adjusting the prefetch distance to the optimal value can increase the average speedup over the SPECcpu2000 benchmark suite from 40% to 70%. Moreover, our investigation of the performance of runahead prefetching relative to stream prefetching shows that choosing a non-optimal stream prefetcher as a baseline can distort the results by as much as a factor of two.

[1]  Yonghong Song,et al.  Processor Aware Anticipatory Prefetching in Loops , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[2]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[3]  Brad Calder,et al.  Predictor-directed stream buffers , 2000, MICRO 33.

[4]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[5]  Norman P. Jouppi,et al.  How useful are non-blocking loads, stream buffers and speculative execution in multiple issue processors? , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[6]  Trevor Mudge,et al.  Improving data cache performance by pre-executing instructions under a cache miss , 1997 .

[7]  Gary S. Tyson,et al.  A prefetch taxonomy , 2004, IEEE Transactions on Computers.

[8]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO 1992.

[9]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[10]  Stamatis Vassiliadis,et al.  A load-instruction unit for pipelined processors , 1993, IBM J. Res. Dev..

[11]  Todd M. Austin,et al.  MASE: a novel infrastructure for detailed microarchitectural modeling , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[12]  Sally A. McKee,et al.  Hardware-only stream prefetching and dynamic access ordering , 2000, ICS '00.

[13]  P. Chow,et al.  Memory-system Design Considerations For Dynamically-scheduled Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[14]  Santosh G. Abraham,et al.  Effective stream-based and execution-based data prefetching , 2004, ICS '04.

[15]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[16]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[17]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[18]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[19]  José F. Martínez,et al.  Checkpointed early load retirement , 2005, 11th International Symposium on High-Performance Computer Architecture.