Understanding the backward slices of performance degrading instructions

For many applications, branch mispredictions and cache misses limit a processor's performance to a level well below its peak instruction throughput. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fraction of such performance degrading events. This paper analyzes the dynamic instruction stream leading up to these performance degrading instructions to identify the operations necessary to execute them early. The backward slice (the subset of the program that relates to the instruction) of these performance degrading instructions, if small compared to the whole dynamic instruction stream, can be pre-executed to hide the instruction's latency. To overcome conservative dependence assumptions that result in large slices, speculation can be used, resulting in speculative slices. This paper provides an initial characterization of the backward slices of L2 data cache misses and branch mispredictions, and shows the effectiveness of techniques, including memory dependence prediction and control independence, for reducing the size of these slices. Through the use of these techniques, many slices can be reduced to less than one tenth of the full dynamic instruction stream when considering the 512 instructions before the performance degrading instruction.

[1]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[2]  Todd M. Austin,et al.  Dynamic dependency analysis of ordinary programs , 1992, ISCA '92.

[3]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[4]  Predictability of load/store instruction latencies , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[5]  Frank Tip,et al.  A survey of program slicing techniques , 1994, J. Program. Lang..

[6]  Eric Rotenberg,et al.  Assigning confidence to conditional branch predictions , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[7]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[8]  Todd C. Mowry,et al.  Predicting data cache misses in non-numeric applications through correlation profiling , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[9]  Gary S. Tyson,et al.  Improving the accuracy and performance of memory communication through renaming , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[10]  Andreas Moshovos,et al.  Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[11]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[12]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[13]  Andreas Moshovos,et al.  Streamlining inter-operation memory communication via data dependence prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[14]  Karel Driesen,et al.  The cascaded predictor: economical and adaptive branch target prediction , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[15]  Andreas Moshovos,et al.  Memory dependence prediction , 1998 .

[16]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[17]  M. Dubois,et al.  Assisted Execution , 1998 .

[18]  Joel S. Emer,et al.  Memory dependence prediction using store sets , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[19]  Olivier Temam,et al.  Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[20]  Memory dependence prediction using store sets , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[21]  Trevor N. Mudge,et al.  The YAGS branch prediction scheme , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[22]  Yale N. Patt,et al.  Simultaneous subordinate microthreading (SSMT) , 1999, ISCA.

[23]  Gary S. Tyson,et al.  Classifying load and store instructions for memory renaming , 1999, ICS '99.

[24]  Gurindar S. Sohi,et al.  The use of multithreading for exception handling , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[25]  Gurindar S. Sohi,et al.  Effective jump-pointer prefetching for linked data structures , 1999, ISCA.

[26]  Michael Gschwind,et al.  Optimizations and oracle parallelism with dynamic translation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[27]  Andreas Moshovos,et al.  Improving virtual function call target prediction via dependence-based pre-computation , 1999, ICS '99.