Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications

The performance of many important commercial workloads, such as on-line transaction processing, is limited by the frequent stalls due to off-chip instruction and data accesses. These applications are characterized by irregular control flow and complex data access patterns that render many low-cost prefetching schemes, such as stream-based and stride-based prefetching, ineffective. For such applications, correlation-based prefetching, which is capable of capturing complex data access patterns, has been shown to be a more promising approach. However, the large instruction and data working sets of these applications require extremely large correlation tables, making these tables impractical to be implemented on-chip. This paper proposes the epoch-based correlation prefetcher, which cost-effectively stores its correlation table in main memory and exploits the concept of epochs to hide the long latency of its correlation table access, and which attempts to eliminate entire epochs instead of individual instruction and data misses. Experimental results demonstrate that the epoch-based correlation prefetches which requires minimal on-chip real estate to implement, improves the performance of a suite of important commercial benchmarks by 13% to 31% and significantly outperforms previously proposed correlation prefetchers.

[1]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[2]  Wei-Fen Lin,et al.  Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[3]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[4]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[5]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[6]  Santosh G. Abraham,et al.  Effective instruction prefetching in chip multiprocessors for modern commercial applications , 2005, 11th International Symposium on High-Performance Computer Architecture.

[7]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[8]  Haitham Akkary,et al.  Checkpoint processing and recovery: towards scalable large instruction window processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[9]  Thomas F. Wenisch,et al.  Temporal streaming of shared memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[10]  David A. Patterson,et al.  Performance characterization of a Quad Pentium Pro SMP using OLTP workloads , 1998, ISCA.

[11]  John P. Hayes,et al.  High-level design verification of microprocessors via error modeling , 1998, TODE.

[12]  Martin Burtscher,et al.  Future execution: A prefetching mechanism that uses multiple cores to speed up single threads , 2006, TACO.

[13]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[14]  James E. Smith,et al.  A Performance Study of Instruction Cache Prefetching Methods , 1998, IEEE Trans. Computers.

[15]  Sule Ozev,et al.  A mechanism for online diagnosis of hard faults in microprocessors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[16]  Jignesh M. Patel,et al.  Call graph prefetching for database applications , 2003, TOCS.

[17]  Brad Calder,et al.  Predictor-directed stream buffers , 2000, MICRO 33.

[18]  Santosh G. Abraham,et al.  Accurate modeling of aggressive speculation in modern microprocessor architectures , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[19]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[20]  Olivier Temam,et al.  MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[21]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO.

[22]  Andrew F. Glew MLP yes! ILP no , 1998, ASPLOS 1998.

[23]  Thomas Alexander,et al.  Distributed prefetch-buffer/cache design for high performance memory systems , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[24]  C. Bazeghi,et al.  /spl mu/Complexity: estimating processor design effort , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[25]  J.F. Martinez,et al.  Cherry: Checkpointed early resource recycling in out-of-order microprocessors , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[26]  Josep Torrellas,et al.  Using a user-level memory thread for correlation prefetching , 2002, ISCA.

[27]  Richard L. Sites,et al.  Binary translation , 1993, CACM.

[28]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[29]  Pedro J. Gil,et al.  Fault Injection into VHDL Models: Experimental Validation of a Fault Tolerant Microcomputer System , 1999, EDCC.

[30]  Huiyang Zhou,et al.  Dual-core execution: building a highly scalable single-thread instruction window , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[31]  Babak Falsafi,et al.  Last-Touch Correlated Data Streaming , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[32]  T. N. Vijaykumar,et al.  Reducing Design Complexity of the Load/Store Queue , 2003, MICRO.

[33]  Santosh G. Abraham,et al.  Store memory-level parallelism optimizations for commercial applications , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[34]  Susan J. Eggers,et al.  An analysis of database workload performance on simultaneous multithreaded processors , 1998, ISCA.

[35]  Sanjeev Kumar,et al.  Exploiting spatial locality in data caches using spatial footprints , 1998, ISCA.

[36]  John Paul Shen,et al.  Scaling and characterizing database workloads: bridging the gap between research and practice , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[37]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[38]  Todd M. Austin,et al.  A fault tolerant approach to microprocessor design , 2001, 2001 International Conference on Dependable Systems and Networks.

[39]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[40]  Mary K. Vernon,et al.  Analytic evaluation of shared-memory systems with ILP processors , 1998, ISCA.

[41]  Thomas F. Wenisch,et al.  Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[42]  Shubhendu S. Mukherjee,et al.  Measuring Architectural Vulnerability Factors , 2003, IEEE Micro.

[43]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[44]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[45]  Glenn Reinman,et al.  Fetch directed instruction prefetching , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[46]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[47]  Jean-Loup Baer,et al.  Dynamic Improvement of Locality in Virtual Memory Systems , 1976, IEEE Transactions on Software Engineering.

[48]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[49]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[50]  Daniel A. Jiménez,et al.  Neural methods for dynamic branch prediction , 2002, TOCS.

[51]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.