Enabling Branch-Mispredict Level Parallelism by Selectively Flushing Instructions

Conventionally, branch mispredictions are resolved by flushing wrongly speculated instructions from the reorder buffer and refetching instructions along the correct path. However, a large part of the misspeculated instructions could have reconverged with the correct path and executed correctly. Yet, they are flushed to ensure in-order commit. This inefficiency has been recognized in prior work, which proposes either complex additions to a core to reuse the correctly executed instructions, or less intrusive solutions that only reuse part of the converged instructions. We propose a hardware-software cooperative mechanism to recover correctly executed instructions, avoiding the need to refetch and re-execute them. It combines relatively limited additions to the core architecture with a high reuse of reconverged instructions. Adding the software hints to enable our mechanism is a similar effort as parallelizing an application, which is already necessary to extract high performance from current multicore processors. We evaluate the technique on emerging graph applications and sorting, applications that are known to perform poorly on conventional CPUs, and report an average 29% increase in performance.

[1]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Mateo Valero,et al.  Control-flow independence reuse via dynamic vectorization , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[3]  Srinivas Devadas,et al.  IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Haitham Akkary,et al.  Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.

[5]  Mayank Agarwal,et al.  Exploiting Postdominance for Speculative Parallelization , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[6]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[7]  Harold W. Cain,et al.  SPF: Selective Pipeline Flush , 2018, 2018 IEEE 36th International Conference on Computer Design (ICCD).

[8]  Haitham Akkary,et al.  Reducing branch misprediction penalty via selective branch recovery , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[9]  Onur Mutlu,et al.  Wish branches: combining conditional branching and predication for adaptive predicated execution , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[10]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[11]  Efraim Rotem,et al.  Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake , 2017, IEEE Micro.

[12]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Farzad Samie,et al.  Power and frequency analysis for data and control independence in embedded processors , 2011, 2011 International Green Computing Conference and Workshops.

[14]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[15]  Balaram Sinharoy,et al.  IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.

[16]  John A. Miller,et al.  Techniques for Graph Analytics on Big Data , 2013, 2013 IEEE International Congress on Big Data.

[17]  James E. Smith,et al.  Advanced Micro Devices , 2005 .

[18]  Eric Rotenberg,et al.  Control independence in trace processors , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[19]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[20]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[21]  Richard A. Lethin,et al.  Highly Scalable Near Memory Processing with Migrating Threads on the Emu System Architecture , 2016, 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3).

[22]  Sreenivas Subramoney,et al.  Auto-Predication of Critical Branches* , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[23]  Quan M. Nguyen,et al.  Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Trevor E. Carlson,et al.  NOREBA: a compiler-informed non-speculative out-of-order commit processor , 2021, ASPLOS.

[25]  Hang Liu,et al.  SIMD-X: Programming and Processing of Graph Algorithms on GPUs , 2018, USENIX Annual Technical Conference.

[26]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[27]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[28]  Stijn Eyerman,et al.  Many-Core Graph Workload Analysis , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Eric Rotenberg,et al.  Transparent control independence (TCI) , 2007, ISCA '07.

[30]  Chen-Yong Cher,et al.  Skipper: a microarchitecture for exploiting control-flow independence , 2001, MICRO.

[31]  Onur Mutlu,et al.  Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[32]  Mayank Agarwal,et al.  Fetch-Criticality Reduction through Control Independence , 2008, 2008 International Symposium on Computer Architecture.

[33]  Michael Gschwind,et al.  IBM POWER8 processor core microarchitecture , 2015, IBM J. Res. Dev..

[34]  A. Kopser,et al.  Overview of the Next Generation Cray XMT , 2011 .

[35]  Qi Li,et al.  Distributed Control Independence for Composable Multi-processors , 2012, 2012 IEEE/ACIS 11th International Conference on Computer and Information Science.

[36]  André Seznec,et al.  A new case for the TAGE branch predictor , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Heiner Litz,et al.  Classifying Memory Access Patterns for Prefetching , 2020, ASPLOS.

[38]  Jeremy Kepner,et al.  Novel graph processor architecture, prototype system, and results , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[39]  Mayank Agarwal,et al.  Branch-mispredict level parallelism (BLP) for control independence , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[40]  Dean M. Tullsen,et al.  Control Flow Optimization Via Dynamic Reconvergence Prediction , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[41]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[42]  Tianshi Chen,et al.  Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Gurindar S. Sohi,et al.  Register integration: a simple and efficient implementation of squash reuse , 2000, MICRO 33.

[44]  Scott B. Baden,et al.  Redefining the Role of the CPU in the Era of CPU-GPU Integration , 2012, IEEE Micro.