Evaluating the impact of dynamic binary translation systems on hardware cache performance

Dynamic binary translation systems enable a wide range of applications such as program instrumentation, optimization, and security. DBTs use a software code cache to store previously translated instructions. The code layout in the code cache greatly differs from the code layout of the original program. This paper provides an exhaustive analysis of the performance of the instruction/trace cache and other structures of the micro-architecture while executing DBTs that focus on program instrumentation, such as DynamoRIO and Pin. We performed our evaluation along two axes. First, we directly accessed the hardware performance counters to determine actual cache miss counts. Second, we used simulation to analyze the spatial locality of the translated application. Our results show that when executing an application under the control of Pin or DynamoRIO, the icache miss counts actually increase over 2X. Surprisingly, the L2 cache and the L1 data cache show a much lower performance degradation or even break even with the native application. We also found that overall performance degradations are due to the instructions added by the DBT itself, and that these extra instructions outweigh any possible spatial locality benefits exhibited in the code cache. Our observations held regardless of the trace length, code cache size, or the presence of a hardware trace cache. These results provide a better understanding of the efficiency of current instrumentation tools and their effects on instruction/trace cache performance and other structures of the microarchitecture.

[1]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[2]  Michael D. Smith,et al.  Managing bounded code caches in dynamic binary optimization systems , 2006, TACO.

[3]  Laurie J. Hendren,et al.  Dynamic profiling and trace cache generation , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[4]  Jack W. Davidson,et al.  Profile guided code positioning , 1990, SIGP.

[5]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[6]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[7]  Mary Lou Soffa,et al.  Retargetable and reconfigurable software dynamic translation , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[8]  Jack W. Davidson,et al.  Evaluating fragment construction policies for SDT systems , 2006, VEE '06.

[9]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[10]  Perry Cheng,et al.  Myths and realities: the performance impact of garbage collection , 2004, SIGMETRICS '04/Performance '04.

[11]  Jack J. Dongarra,et al.  End-user Tools for Application Performance Analysis Using Hardware Counters , 2001, ISCA PDCS.

[12]  Jack W. Davidson,et al.  Secure and practical defense against code-injection attacks using software dynamic translation , 2006, VEE '06.

[13]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[14]  Michael D. Smith,et al.  Improving region selection in dynamic optimization systems , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[15]  Duane Merrill,et al.  Trace fragment selection within method-based JVMs , 2008, VEE '08.

[16]  Evelyn Duesterwald,et al.  Design and implementation of a dynamic optimization framework for windows , 2000 .

[17]  E. Duesterwald,et al.  Software profiling for hot path prediction: less is more , 2000, SIGP.

[18]  Wen-mei W. Hwu,et al.  Achieving High Instruction Cache Performance With An Optimizing Compiler , 1989, The 16th Annual International Symposium on Computer Architecture.

[19]  James R. Larus,et al.  Making Pointer-Based Data Structures Cache Conscious , 2000, Computer.

[20]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, TOCS.

[21]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[22]  Kim M. Hazelwood,et al.  SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[23]  Mary Lou Soffa,et al.  Overhead reduction techniques for software dynamic translation , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[24]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[25]  Alan Jay Smith,et al.  Second bibliography on Cache memories , 1991, CARN.

[26]  Matthias Hauswirth,et al.  Using Hardware Performance Monitors to Understand the Behavior of Java Applications , 2004, Virtual Machine Research and Technology Symposium.

[27]  Mateo Valero,et al.  Software Trace Cache , 2014, IEEE Transactions on Computers.

[28]  Weng-Fai Wong,et al.  Compiler orchestrated prefetching via speculation and predication , 2004, ASPLOS XI.

[29]  Scott McFarling,et al.  Program optimization for instruction caches , 1989, ASPLOS III.