Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?

In this paper, we set out to study the performance advantages of an Out-of-Order (OOO) processor relative to in-order processors with similar execution resources. In particular, we try to tease apart the performance contributions from two sources: the improved sched- ules enabled by OOO hardware speculation support and its ability to generate different schedules on different occurrences of the same instructions based on operand and functional unit availability. We find that the ability to express good static schedules achieves the bulk of the speedup resulting from OOO. Specifically, of the 53% speedup achieved by OOO relative to a similarly provisioned in- order machine, we find that 88% of that speedup can be achieved by using a single "best" static schedule as suggested by observing an OOO schedule of the code. We discuss the ISA mechanisms that would be required to express these static schedules. Furthermore, we find that the benefits of dynamism largely come from two kinds of events that influence the application's critical path: load instructions that miss in the cache only part of the time and branch mispredictions. We find that much of the benefit of OOO dynamism can be achieved by the potentially simpler task of addressing these two behaviors directly.

[1]  Olivier Temam,et al.  VHC: quickly building an optimizer for complex embedded architectures , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2]  Martin Hopkins,et al.  Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[3]  Sanjay J. Patel,et al.  Improving quasi-dynamic schedules through region slip , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[4]  Kathryn S. McKinley,et al.  Static placement, dynamic issue (SPDI) scheduling for EDGE architectures , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[5]  Youngmoon Choi,et al.  The next-generation 64b SPARC core in a T4 SoC processor , 2012, 2012 IEEE International Solid-State Circuits Conference.

[6]  Erik R. Altman,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[7]  Bantwal R. Rau Dynamically scheduled VLIW processors , 1993, MICRO 1993.

[8]  Kanad Ghose,et al.  Incremental commit groups for non-atomic trace processing , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[9]  Jack Doweck,et al.  Inside Intel® Core microarchitecture , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[10]  Óscar Palomar Pérez Reusing cached schedules in an out-of-order processor with in-order issue logic , 2011 .

[11]  Michael D. Smith,et al.  Efficient superscalar performance through boosting , 1992, ASPLOS V.

[12]  Yale N. Patt,et al.  An investigation of the performance of various dynamic scheduling techniques , 1992, MICRO.

[13]  Craig B. Zilles,et al.  Hardware atomicity for reliable software speculation , 2007, ISCA '07.

[14]  Michael J. Flynn,et al.  Instruction-level parallel processors-dynamic and static scheduling tradeoffs , 1997, Proceedings of IEEE International Symposium on Parallel Algorithms Architecture Synthesis.

[15]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[16]  K. Ebcioglu,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[17]  D. J. Lalja,et al.  Reducing the branch penalty in pipelined processors , 1988, Computer.

[18]  J. M. Codina,et al.  SoftHV: a HW/SW co-designed processor with horizontal and vertical fusion , 2011, CF '11.

[19]  Rastislav Bodík,et al.  Focusing processor policies via critical-path prediction , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[20]  Sanjay J. Patel,et al.  Increasing the size of atomic instruction blocks using control flow assertions , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[21]  Jeffrey R. Diamond,et al.  An evaluation of the TRIPS computer system , 2009, ASPLOS.

[22]  Eric M. Schwarz,et al.  IBM POWER6 microarchitecture , 2007, IBM J. Res. Dev..

[23]  Harsh Sharangpani,et al.  Itanium Processor Microarchitecture , 2000, IEEE Micro.

[24]  Steve Undy Poulson: An 8 core 32 nm next generation Intel® Itanium® processor , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[25]  Emil Talpes,et al.  Execution cache-based microarchitecture for power-efficient superscalar processors , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[26]  Ghassan Shobaki,et al.  Optimal global instruction scheduling using enumeration , 2006 .

[27]  Lizy Kurian John,et al.  Low-power, low-complexity instruction issue using compiler assistance , 2005, ICS '05.

[28]  Sanjay J. Patel,et al.  rePLay: A Hardware Framework for Dynamic Optimization , 2001, IEEE Trans. Computers.

[29]  Donald Yeung,et al.  Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.

[30]  J. P. Grossman Cheap out-of-order execution using delayed issue , 2000, Proceedings 2000 International Conference on Computer Design.

[31]  Margaret Martonosi,et al.  Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors , 1996, ISCA.

[32]  Wayne Wolf,et al.  Evaluation of Static and Dynamic Scheduling for Media Processors , 2000 .

[33]  Harry F. Jordan,et al.  An investigation of static versus dynamic scheduling , 1990, ISCA '90.

[34]  Sanjay J. Patel,et al.  Performance characterization of a hardware mechanism for dynamic optimization , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[35]  E LoveCarl,et al.  An investigation of static versus dynamic scheduling , 1990 .

[36]  Calvin Lin,et al.  Combining Hyperblocks and Exit Prediction to Increase Front-End Bandwidth and Performance , 2002 .

[37]  Brad Calder,et al.  An EPIC Processor with Pending Functional Units , 2002, ISHPC.

[38]  Sanjay J. Patel,et al.  Beating in-order stalls with "flea-flicker" two-pass pipelining , 2006, IEEE transactions on computers.

[39]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[40]  Mark Heffernan,et al.  Data-Dependency Graph Transformations for Instruction Scheduling , 2005, J. Sched..

[41]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[42]  Scott A. Mahlke,et al.  Comparing static and dynamic code scheduling for multiple-instruction-issue processors , 1991, MICRO 24.

[43]  Harold W. Cain,et al.  Runahead execution vs. conventional data prefetching in the IBM POWER6 microprocessor , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[44]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[45]  A. Klaiber The Technology Behind Crusoe TM Processors Low-power x 86-Compatible Processors Implemented with Code Morphing , 2000 .

[46]  S McFarlinDaniel,et al.  Discerning the dominant out-of-order performance advantage , 2013 .

[47]  Abhishek Tiwari,et al.  Enhanching MLP : Runahead Execution and Related Techniques , 2006 .

[48]  Santosh Nagarakatte,et al.  iCFP: Tolerating all-level cache misses in in-order processors , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[49]  Ghassan Shobaki,et al.  Optimal trace scheduling using enumeration , 2009, TACO.