Compiler-Assisted Dynamic Predicated Execution of Complex Control-Flow Structures

Even after decades of research in branch prediction, branch predictors still remain imperfect, which results in significant performance loss in aggressive processors that support large instruction windows and deep pipelines. This paper proposes a new processor architecture for handling hard-to-predict branches, the diverge-merge processor. The goal of this paradigm is to eliminate branch mispredictions due to hard-to-predict dynamic branches by dynamically predicating them. To achieve this without incurring large hardware cost and complexity, the compiler identifies branches that are suitable for dynamic predication called diverge branches. The compiler also selects a control-flow merge (or reconvergence) point corresponding to each diverge branch to aid dynamic predication. If a diverge branch is hard-to-predict at run-time, the microarchitecture dynamically predicates the instructions between the diverge branch and the corresponding merge point by first executing one path after the branch, then executing the other path, and later merging the data-flow produced by the two paths using special select-uop instructions. The control-flow merge point is selected based on the frequently-executed paths in the program using profile information. Therefore, the control-flow from a diverge branch does not have to merge (but it usually does), which allows the dynamic predication of a much larger set of branches than simple hammock (if-else) branches . Our evaluations show that a diverge-merge processor outperforms a baseline with an aggressive branch predictor by 10.8% on average over 15 SPEC CPU2000 benchmarks, through an average reduction of 31% in pipeline flushes due to branch mispredictions. Furthermore, the proposed mechanism outperforms a previously-proposed dynamic predication mechanism that can predicate only simple hammock branches by 7.8%.

[1]  Eric Rotenberg,et al.  Assigning confidence to conditional branch predictions , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[2]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[3]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[4]  Onur Mutlu,et al.  Wish branches: combining conditional branching and predication for adaptive predicated execution , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[5]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[6]  Haitham Akkary,et al.  Reducing branch misprediction penalty via selective branch recovery , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[7]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[8]  Scott A. Mahlke,et al.  Characterizing the impact of predicated execution on branch prediction , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Dirk Grunwald,et al.  Selective eager execution on the PolyPath architecture , 1998, ISCA.

[10]  Francisco J. Cazorla,et al.  Kilo-instruction processors: overcoming the memory wall , 2005, IEEE Micro.

[11]  Dean M. Tullsen,et al.  Control Flow Optimization Via Dynamic Reconvergence Prediction , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[12]  Quinn Jacobson,et al.  A study of control independence in superscalar processors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[13]  David W. Anderson,et al.  The IBM System/360 model 91: machine philosophy and instruction-handling , 1967 .

[14]  John Paul Shen,et al.  Reducing branch misprediction penalties via dynamic control independence detection , 1999, ICS '99.

[15]  Dirk Grunwald,et al.  Dynamic hammock predication for non-predicated instruction set architectures , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[16]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[17]  Youngsoo Choi,et al.  The impact of If-conversion and branch prediction on program execution on the Intel/sup R/ Itanium/sup TM/ processor , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[18]  Marc Tremblay,et al.  High-performance throughput computing , 2005, IEEE Micro.

[19]  E. Smith,et al.  Selective Dual Path Execution , 1996 .

[20]  Eric Sprangle,et al.  Increasing processor performance by implementing deeper pipelines , 2002, ISCA.

[21]  Edward S. Davidson,et al.  Highly concurrent scalar processing , 1986, ISCA 1986.

[22]  Edward M. Riseman,et al.  The Inhibition of Potential Parallelism by Conditional Jumps , 1972, IEEE Transactions on Computers.

[23]  Youngsoo Choi,et al.  The impact of if-conversion and branch prediction on program execution on the Intel Itanium processor , 2001, MICRO.

[24]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[25]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[26]  Chen-Yong Cher,et al.  Skipper: a microarchitecture for exploiting control-flow independence , 2001, MICRO.

[27]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[28]  Daniel A. Jiménez,et al.  Dynamic branch prediction with perceptrons , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[29]  Eric Rotenberg,et al.  Control independence in trace processors , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[30]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..