Alternative implementations of hybrid branch predictors

Prefetching has been shown to be one of several effective approaches that can tolerate large memory latencies. In this paper, we consider a prefetch engine called Hare, which handles prefetches at run time and is built in addition to the data pipelining in the on-chip data cache for high-performance processors. The key design is that it is programmable so that techniques of software prefetching can be also employed in exploiting the benefits of prefetching. The engine always launches prefetches ahead of current execution, which is controlled by the program counter. We evaluate the proposed scheme by trace-driven simulation and consider area and cycle time factors for the evaluation of cost-effectiveness. Our performance results show that the prefetch engine can significantly reduce data access penalty with only little prefetching overhead.

[1]  James E. Smith,et al.  Decoupled access/execute computer architectures , 1984, TOCS.

[2]  Yale N. Patt,et al.  Critical issues regarding HPS, a high performance microarchitecture , 1985, MICRO 18.

[3]  Yale N. Patt,et al.  HPS, a new microarchitecture: rationale and introduction , 1985, MICRO 18.

[4]  Yale N. Patt,et al.  Run-time generation of HPS microinstructions from a VAX instruction stream , 1986, MICRO 19.

[5]  S. McFarling,et al.  Reducing the cost of branches , 1986, ISCA '86.

[6]  Yale N. Patt,et al.  On tuning the microarchitecture of an HPS implementation of the VAX , 1987, MICRO 20.

[7]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[8]  Yale N. Patt,et al.  Hardware Support For Large Atomic Units in Dynamically Scheduled Machines , 1988, [1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21.

[9]  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990 .

[10]  Michael J. Flynn,et al.  An area model for on-chip memories and its application , 1991 .

[11]  H. Levy,et al.  An architecture for software-controlled data prefetching , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[12]  ScalesHunter,et al.  Single instruction stream parallelism is greater than two , 1991 .

[13]  T. Wada,et al.  An analytical access time model for on-chip cache memories , 1992 .

[14]  Joseph T. Rahmeh,et al.  Improving the accuracy of dynamic branch prediction using branch correlation , 1992, ASPLOS V.

[15]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[16]  Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation , 1992, ASPLOS.

[17]  J.W.C. Fu,et al.  Stride Directed Prefetching In Scalar Processors , 1992, [1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25.

[18]  Joseph A. Fisher,et al.  Predicting conditional branch directions from previous runs of a program , 1992, ASPLOS V.

[19]  Yale N. Patt,et al.  Alternative implementations of two-level adaptive branch prediction , 1992, ISCA '92.

[20]  A. Veidenbaum,et al.  The cedar system and an initial performance study , 1993, ISCA '93.

[21]  Yale N. Patt,et al.  A Comparison Of Dynamic Branch Predictors That Use Two Levels Of Branch History , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[22]  Dean M. Tullsen,et al.  Limitations Of Cache Prefetching On A Bus-based Multiprocessor , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[23]  S. McFarling Combining Branch Predictors , 1993 .

[24]  Yale N. Patt,et al.  Branch classification: a new mechanism for improving branch predictor performance , 1994, MICRO.

[25]  Vivek Sarkar,et al.  A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness , 1994, CASCON.

[26]  Yale N. Patt,et al.  Branch Classification: A New Mechanism for Improving Branch Predictor Performance , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[27]  Norman P. Jouppi,et al.  WRL Research Report 93/5: An Enhanced Access and Cycle Time Model for On-chip Caches , 1994 .

[28]  Manoj Franklin,et al.  A fill-unit approach to multiple instruction issue , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[29]  Thomas J. LeBlanc,et al.  A Preliminary Evaluation of Cache-Miss-Initiated Prefetching Techniques in Scalable Multiprocessors , 1994 .

[30]  Tzi-cker Chiueh,et al.  Sunder: a programmable hardware prefetch architecture for numerical loops , 1994, Proceedings of Supercomputing '94.

[31]  Michael D. Smith,et al.  A comparative analysis of schemes for correlated branch prediction , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[32]  Mario Nemirovsky,et al.  The influence of branch prediction table interference on branch prediction scheme performance , 1995, PACT.

[33]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.