AVPP

Value prediction improves instruction level parallelism in superscalar processors by breaking true data dependencies. Although this technique can significantly improve overall performance, most of the state-of-the-art value prediction approaches require high hardware cost, which is the main obstacle for its wide adoption in current processors. To tackle this issue, we revisit load value prediction as an efficient alternative to the classical approaches that predict all instructions. By speculating only on loads, the pressure over shared resources (e.g., the Physical Register File) and the predictor size can be substantially reduced (e.g., more than 90% reduction compared to recent works). We observe that existing value predictors cannot achieve very high performance when speculating only on load instructions. To solve this problem, we propose a new, accurate and low-cost mechanism for predicting the values of load instructions: the Address-first Value-next Predictor with Value Prefetching (AVPP). The key idea of our predictor is to predict the load address first (which, we find, is much more predictable than the value) and to use a small non-speculative Value Table (VT)—indexed by the predicted address—to predict the value next. To increase the coverage of AVPP, we aim to increase the hit rate of the VT by predicting also the load address of a future instance of the same load instruction and prefetching its value in the VT. We show that AVPP is relatively easy to implement, requiring only 2.5% of the area of a 32KB L1 data cache. We compare our mechanism with five state-of-the-art value prediction techniques, evaluated within the context of load value prediction, in a relatively narrow out-of-order processor. On average, our AVPP predictor achieves 11.2% speedup and 3.7% of energy savings over the baseline processor, outperforming all the state-of-the-art predictors in 16 of the 23 benchmarks we evaluate. We evaluate AVPP implemented together with different prefetching techniques, showing additive performance gains (20% average speedup). In addition, we propose a new taxonomy to classify different value predictor policies regarding predictor update, predictor availability, and in-flight pending updates. We evaluate these policies in detail.

[1]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[2]  Stamatis Vassiliadis,et al.  A load-instruction unit for pipelined processors , 1993, IBM J. Res. Dev..

[3]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[4]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[5]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[6]  G.S. Sohi,et al.  Dynamic Speculation And Synchronization Of Data Dependence , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[7]  José González,et al.  Speculative execution via address prediction and data prefetching , 1997, ICS '97.

[8]  Thomas M. Conte,et al.  Value speculation scheduling for high performance processors , 1998, ASPLOS VIII.

[9]  Victor V. Zyuban,et al.  The energy complexity of register files , 1998, Proceedings. 1998 International Symposium on Low Power Electronics and Design (IEEE Cat. No.98TH8379).

[10]  Glenn Reinman,et al.  A Comparative Survey of Load Speculation Architectures , 2000, J. Instr. Level Parallelism.

[11]  Shai Rubin,et al.  Focusing processor policies via critical-path prediction , 2001, ISCA 2001.

[12]  Toshinori Sato,et al.  Low-Cost Value Predictors Using Frequent Value Locality , 2002, ISHPC.

[13]  Jose Renau,et al.  CAVA: Using checkpoint-assisted value prediction to hide L2 misses , 2006, TACO.

[14]  Pierre Michaud,et al.  A case for (partially) TAgged GEometric history length branch prediction , 2006, J. Instr. Level Parallelism.

[15]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Kei Hiraki,et al.  Access map pattern matching for data cache prefetch , 2009, ICS.

[18]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[19]  Eric Rotenberg,et al.  EXACT: explicit dynamic-branch prediction with active updates , 2010, CF '10.

[20]  James Cownie,et al.  PinPlay: a framework for deterministic replay and reproducible analysis of parallel programs , 2010, CGO '10.

[21]  Onur Mutlu,et al.  Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[22]  Onur Mutlu,et al.  Utility-based acceleration of multithreaded applications on asymmetric CMPs , 2013, ISCA.

[23]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[24]  Onur Mutlu,et al.  Rollback-free value prediction with approximate loads , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[25]  Yale N. Patt,et al.  Filtered runahead execution with a runahead buffer , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  RFVP , 2016 .

[27]  Harold W. Cain,et al.  Load Value Prediction via Path-based Address Prediction: Avoiding Mispredictions due to Conflicting Stores , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  André Seznec,et al.  On the Interactions Between Value Prediction and Compiler Optimizations in the Context of EOLE , 2017, ACM Trans. Archit. Code Optim..