Increasing the instruction fetch rate via multiple branch prediction and a branch address cache

"Increasing the Instruction Fetch Rate viaMultiple Branch Prediction and a Branch Address Cache" was the first paper to propose a highly accurate hardware mechanism for predicting and fetching multiple non-contiguous basic blocks using leading-edge aggressive branch predictors of the time. Prior to this paper, the methods to increase fetch bandwidth relied on software and compiler mechanisms to increase the size of the basic blocks themselves. The publication of our paper inspired an explosion of research to further improve the accuracy of multiple branch prediction, reduce complexity of fetching multiple basic blocks, or increase the fetch bandwidth in other ways. Our HPS research group defied conventional wisdom in a 1991 ISCA paper [1] that demonstrated instruction level parallelism (IPC) can be greater than two even for non-scientific workloads. This flew in the face of numerous "proofs" at the time. We showed that there was enough parallelism for an aggressive superscalar out-of-order execution engine to exploit to dramatically improve performance, provided that the penalties caused by incorrect branch predictions, cache misses, and TLB misses could be improved.

[1]  Yale N. Patt,et al.  A Comparison Of Dynamic Branch Predictors That Use Two Levels Of Branch History , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[2]  Tse-Yu Yeh,et al.  A Comprehensive Instruction Fetch Mechanism For A Processor Supporting Speculative Execution , 1992, [1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25.

[3]  Joseph T. Rahmeh,et al.  Improving the accuracy of dynamic branch prediction using branch correlation , 1992, ASPLOS V.

[4]  Yale N. Patt,et al.  Alternative implementations of two-level adaptive branch prediction , 1992, ISCA '92.

[5]  Y. Patt,et al.  Two-level adaptive training branch prediction , 1991, MICRO 24.

[6]  Y. Patt,et al.  Single instruction stream parallelism is greater than two , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[7]  B. R. Rau,et al.  The Cydra 5 Departmental Supercomputer: design philosophies, decisions and trade-offs , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.

[8]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS.

[9]  Alan Jay Smith,et al.  Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.