Fetching instruction streams

Fetch performance is a very important factor because it effectively limits the overall processor performance. However there is little performance advantage in increasing front-end performance beyond what the back-end can consume. For each processor design, the target is to build the best possible fetch engine for the required performance level. A fetch engine will be better if it provides better performance, but also if it takes fewer resources, requires less chip area, or consumes less power. In this paper we propose a novel fetch architecture based on the execution of long streams of sequential instructions, taking maximum advantage of code layout optimizations. We describe our architecture in detail, and show that it requires less complexity and resources than other high performance fetch architectures like the trace cache, while providing a high fetch performance suitable for wide-issue superscalar processors. Our results show that using our fetch architecture and code layout optimizations obtains 10% higher performance than the EV8 fetch architecture, and 4% higher than the FTB architecture using state-of-the-art branch predictors, while being only 1.5% slower than the trace cache. Even in the absence of code layout optimizations, fetching instruction streams is still 10% faster than the EV8, and only 4% slower than the trace cache. Fetching instruction streams effectively exploits the special characteristics of layout optimized codes to provide a high fetch performance, close to that of a trace cache, but has a much lower cost and complexity, similar to that of a basic block architecture.

[1]  Mateo Valero,et al.  The effect of code reordering on branch prediction , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[2]  Quinn Jacobson,et al.  Control flow speculation in multiscalar processors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[3]  Dirk Grunwald,et al.  Reducing branch costs via branch alignment , 1994, ASPLOS VI.

[4]  Michael D. Smith,et al.  Procedure placement using temporal-ordering information , 1999, TOPL.

[5]  D. Grunwald,et al.  Fast & Accurate Instruction Fetch and Branch Prediction , 1994 .

[6]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[7]  Walid A. Najjar,et al.  Design of storage hierarchy in multithreaded architectures , 1995, MICRO 1995.

[8]  Mateo Valero,et al.  Trace cache redundancy: red and blue traces , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[9]  Yale N. Patt,et al.  A comprehensive instruction fetch mechanism for a processor supporting speculative execution , 1992, MICRO 25.

[10]  James E. Smith,et al.  Path-based next trace prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[11]  John C. Gyllenhaal,et al.  A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization , 1999, ISCA.

[12]  W. W. Hwu,et al.  Achieving high instruction cache performance with an optimizing compiler , 1989, ISCA '89.

[13]  Daniel A. Jiménez,et al.  Dynamic branch prediction with perceptrons , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[14]  Dirk Grunwald,et al.  Fast and accurate instruction fetch and branch prediction , 1994, ISCA '94.

[15]  Pascal Sainrat,et al.  Multiple-block ahead branch predictors , 1996, ASPLOS VII.

[16]  Josep Torrellas,et al.  Optimizing instruction cache performance for operating system intensive workloads , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[17]  Yale N. Patt,et al.  Putting the fill unit to work: dynamic optimizations for trace cache microprocessors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[18]  Eric Rotenberg,et al.  A Trace Cache Microarchitecture and Evaluation , 1999, IEEE Trans. Computers.

[19]  Manoj Franklin,et al.  Control flow prediction with tree-like subgraphs for superscalar processors , 1995, MICRO 1995.

[20]  Yale N. Patt,et al.  Improving trace cache effectiveness with branch promotion and trace packing , 1998, ISCA.

[21]  Burzin A. Patel,et al.  Optimization of instruction fetch mechanisms for high issue rates , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[22]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[23]  Norman Rubin,et al.  Spike: an optimizer for alpha/NT executables , 1997 .

[24]  Yale N. Patt,et al.  Increasing the instruction fetch rate via multiple branch prediction and a branch address cache , 1993, ICS '93.

[25]  Sanjay J. Patel,et al.  Increasing the size of atomic instruction blocks using control flow assertions , 2000, MICRO 33.

[26]  James R. Larus,et al.  Branch prediction for free , 1993, PLDI '93.

[27]  Glenn Reinman,et al.  A scalable front-end architecture for fast instruction delivery , 1999, ISCA.

[28]  Karl Pettis,et al.  Profile guided code positioning , 1990, PLDI '90.

[29]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[30]  Dirk Grunwald,et al.  Next cache line and set prediction , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[31]  Brad Calder,et al.  Efficient procedure mapping using cache line coloring , 1997, PLDI '97.

[32]  Yale N. Patt,et al.  Alternative fetch and issue policies for the trace cache fetch mechanism , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[33]  Yiannakis Sazeides,et al.  Design tradeoffs for the Alpha EV8 conditional branch predictor , 2002, ISCA.