Scaling the issue window with look-ahead latency prediction

In contemporary out-of-order superscalar design, high IPC is mainly achieved by exposing high instruction level parallelism (ILP). Scaling issue window size can certainly provide more ILP; however, future processor scaling demands threaten to limit the size of the issue window.In this study, we propose a dynamic instruction sorting mechanism that provides more ILP without increasing the size of the issue window. In our approach, early in the pipeline, we predict how long an instruction needs to wait before it can be issued, i.e. the waiting time for its operands to be produced. Using this knowledge, the instructions are placed into a sorting structure, which allows instructions with shorter waiting times enter the issue window ahead of those instructions with longer waiting times, preventing long-waiting instructions from clogging the issue queue.The accuracy in predicting instruction waiting times directly determines the effectiveness of our sorting mechanism. While most instructions have deterministic execution latencies, predicting load execution times is more difficult due to cache misses and in-flight loads. Loads are particularly challenging since their execution time can vary significantly. In this study, we examine techniques to predict load execution time accurately, based on data reference history.

[1]  Kai Wang,et al.  Highly accurate data value prediction using hybrid predictors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[2]  Enric Morancho,et al.  Recovery mechanism for latency misprediction , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[3]  Ramon Canal,et al.  A low-complexity issue logic , 2000, ICS '00.

[4]  Pierre Michaud,et al.  Data-flow prescheduling for large instruction windows in out-of-order processors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[5]  James E. Smith,et al.  An instruction set and microarchitecture for instruction level distributed processing , 2002, ISCA.

[6]  Steven K. Reinhardt,et al.  A scalable instruction queue design using dependence chains , 2002, ISCA.

[7]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[8]  Glenn Reinman,et al.  Just say no: benefits of early cache miss determination , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[9]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[10]  Pen-Chung Yew,et al.  On some implementation issues for value prediction on wide-issue ILP processors , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[11]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[12]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[13]  Todd M. Austin,et al.  Cyclone: a broadcast-free dynamic instruction scheduler with selective replay , 2003, ISCA '03.

[14]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[15]  Ivan E. Sutherland,et al.  The counterflow pipeline processor architecture , 1994, IEEE Design & Test of Computers.

[16]  Alvin R. Lebeck,et al.  Fast instruction window for tolerating cache misses , 2002, ISCA 2002.

[17]  Eric Rotenberg,et al.  A large, fast instruction window for tolerating cache misses , 2002, ISCA.

[18]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[19]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.