Branch Prediction, Instruction-Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques

Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Trade-offs among instruction-window size, branch-prediction accuracy, and instruction- and data-cache size can change as these parameters move through different domains. For example, modeling unrealistic caches can under- or overstate the benefits of better prediction or a larger instruction window. Avoiding such pitfalls requires understanding how all these parameters interact. Because such methodological mistakes are common, this paper provides a comprehensive set of SimpleScalar simulation results from SPECint95 programs, showing the interactions among these major structures. In addition to presenting this database of simulation results, major mechanisms driving the observed trade-offs are described. The paper also considers appropriate simulation techniques when sampling full-length runs with the SPEC reference inputs. In particular, the results show that branch mispredictions limit the benefits of larger instruction windows, that better branch prediction and better instruction cache behavior have synergistic effects, and that the benefits of larger instruction windows and larger data caches trade off and have overlapping effects. In addition, simulations of only 50 million instructions can yield representative results if these short windows are carefully selected.

[1]  Richard Johnson,et al.  Analysis techniques for predicated code , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[2]  David A. Wood,et al.  A model for estimating trace-sample miss ratios , 1991, SIGMETRICS '91.

[3]  Yale N. Patt,et al.  Alternative implementations of hybrid branch predictors , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[4]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS III.

[5]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[6]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[7]  Michael D. Smith,et al.  Improving the accuracy of static branch prediction using branch correlation , 1994, ASPLOS VI.

[8]  Yale N. Patt,et al.  The agree predictor: a mechanism for reducing negative branch history interference , 1997, ISCA '97.

[9]  Yale N. Patt,et al.  A comparison of dynamic branch predictors that use two levels of branch history , 1993, ISCA '93.

[10]  Wei-Chung Hsu,et al.  Data Prefetching On The HP PA-8000 , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[11]  David I. August,et al.  Architectural support for compiler-synthesized dynamic branch prediction strategies: Rationale and initial results , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[12]  Yale N. Patt,et al.  The effect of speculatively updating branch history on branch prediction accuracy, revisited , 1994, MICRO 27.

[13]  Anoop Gupta,et al.  Working sets, cache sizes, and node granularity issues for large-scale multiprocessors , 1993, ISCA '93.

[14]  Joseph A. Fisher,et al.  Predicting conditional branch directions from previous runs of a program , 1992, ASPLOS V.

[15]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS 1989.

[16]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor architecture , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[17]  Margaret Martonosi,et al.  Effectiveness of trace sampling for performance debugging tools , 1993, SIGMETRICS '93.

[18]  Scott A. Mahlke,et al.  Compiler synthesized dynamic branch prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[19]  Ran Ginosar,et al.  Kin: a high performance asynchronous processor architecture , 1998, ICS '98.

[20]  Margaret Martonosi,et al.  Selecting a Single, Representative Sample for Accurate Simulation of SPECint Benchmarks , 1999 .

[21]  Yale N. Patt,et al.  A Comparison Of Dynamic Branch Predictors That Use Two Levels Of Branch History , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[22]  Margaret Martonosi,et al.  Multipath execution: opportunities and limits , 1998, ICS '98.

[23]  S. McFarling Combining Branch Predictors , 1993 .

[24]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[25]  Gurindar S. Sohi,et al.  Instruction issue logic for high-performance, interruptable pipelined processors , 1987, ISCA '87.

[26]  Joseph T. Rahmeh,et al.  Improving the accuracy of dynamic branch prediction using branch correlation , 1992, ASPLOS V.

[27]  Janak H. Patel,et al.  Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems , 1988, IEEE Trans. Computers.

[28]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[29]  Yale N. Patt,et al.  An analysis of correlation and predictability: what makes two-level branch predictors work , 1998, ISCA.

[30]  Margaret Martonosi,et al.  Improving prediction for procedure returns with return-address-stack repair mechanisms , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[31]  Margaret Martonosi,et al.  Speculative Updates of Local and Global Branch History: A Quantitative Analysis , 2000, J. Instr. Level Parallelism.

[32]  Ann Marie Grizzaffi Maynard,et al.  Contrasting characteristics and cache performance of technical and multi-user commercial workloads , 1994, ASPLOS VI.

[33]  Wen-mei W. Hwu,et al.  Run-Time Adaptive Cache Hierarchy Management via Reference Analysis , 1997, ISCA.

[34]  N. Jouppi,et al.  The Relative Importance of Memory Latency , Bandwidth , and Branch Limits toPerformanceNorman , 1997 .

[35]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[36]  Trevor Mudge,et al.  The role of adaptivity in two-level adaptive branch prediction , 1995, MICRO 1995.

[37]  Margaret Martonosi,et al.  Alloying Global and Local Branch History: Taxonomy, Performance, and Analysis , 1999 .

[38]  B. Ramakrishna Rau,et al.  The Cydra 5 departmental supercomputer: design philosophies, decisions, and trade-offs , 1989, Computer.

[39]  Alvin R. Lebeck,et al.  Load latency tolerance in dynamically scheduled processors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[40]  Trevor N. Mudge,et al.  Correlation and Aliasing in Dynamic Branch Predictors , 1996, ISCA.

[41]  Trevor N. Mudge,et al.  The bi-mode branch predictor , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[42]  Gurindar S. Sohi,et al.  Instruction issue logic for high-performance, interruptable pipelined processors , 1987, ISCA '98.

[43]  Nicholas C. Gloy,et al.  A Language For Describing Predictors And Its Application To Automatic Synthesis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[44]  A. Seznec,et al.  Trading Conflict And Capacity Aliasing In Conditional Branch Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[45]  Michael D. Smith,et al.  A comparative analysis of schemes for correlated branch prediction , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[46]  Karel Driesen,et al.  Accurate indirect branch prediction , 1998, ISCA.

[47]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[48]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[49]  P. Chow,et al.  Memory-system Design Considerations For Dynamically-scheduled Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[50]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[51]  Yale N. Patt,et al.  Target prediction for indirect jumps , 1997, ISCA '97.

[52]  Trevor N. Mudge,et al.  The YAGS branch prediction scheme , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[53]  Kevin Skadron,et al.  Design issues and tradeoffs for write buffers , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[54]  Michael D. Smith,et al.  Limits on multiple instruction issue , 1989, ASPLOS III.

[55]  Kenneth M. Wilson,et al.  Designing High Bandwidth On-chip Caches , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[56]  Brad Calder,et al.  Threaded multiple path execution , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[57]  D. Grunwald,et al.  Fast & Accurate Instruction Fetch and Branch Prediction , 1994 .

[58]  Trevor N. Mudge,et al.  Wrong-path instruction prefetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[59]  Yale N. Patt,et al.  An effective programmable prefetch engine for on-chip caches , 1995, MICRO 1995.

[60]  James E. Smith,et al.  Path-based next trace prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[61]  Dirk Grunwald,et al.  Fast and accurate instruction fetch and branch prediction , 1994, ISCA '94.

[62]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[63]  David A. Wood,et al.  A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches , 1994, IEEE Trans. Computers.

[64]  Dirk Grunwald,et al.  Selective eager execution on the PolyPath architecture , 1998, ISCA.

[65]  Dirk Grunwald,et al.  Reducing indirect function call overhead in C++ programs , 1994, POPL '94.

[66]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[67]  Chih-Chieh Lee,et al.  Correlation and Aliasing in Dynamic Branch Predictors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).