Fast and Efficient Partial Code Reordering : Taking Advantage of Dynamic Recompilation

Poor instruction cache locality can degrade performance on modern architectures. For example, our simulation results show that eliminating all instruction cache misses improves performance by as much as 16% for a modestly sized instruction cache. In this paper, we show how to take advantage of dynamic code generation in a Java Virtual Machine (VM) to improve instruction locality at run-time. We develop a dynamic code reordering (DCR) system; a low overhead, online approach for improving instruction locality. DCR has three optimizations: (1) Interprocedural method separation; (2) Intraprocedural code splitting; and (3) Code padding. DCR uses the dynamic call graph and an edge profile that most VMs already collect to separate hot/cold methods and hot/cold code within a method. It also puts padding between methods to minimize conflict misses between frequent caller/callee pairs. It incrementally performs these optimizations only when the VM is optimizing a method at a higher level. We implement DCR in Jikes RVM and show its overhead is negligible. Extensive simulation and run-time experiments show that a simple code space improves average performance on a Pentium 4 by around 6% on SPEC and DaCapo Java benchmarks. These programs however have very small instruction cache footprints that limit opportunities for DCR to improve performance. Consequently, DCR optimizations on average show little effect, sometimes degrading performance and occasionally improving performance by up to 5%. Our work shows that the VM has the potential to dynamically improve instruction locality incrementally by simply piggybacking on hotspot recompilation.

[1]  James E. Smith,et al.  Exploring code cache eviction granularities in dynamic optimization systems , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2]  Lieven Eeckhout,et al.  How java programs interact with virtual machines at the microarchitectural level , 2003, OOPSLA 2003.

[3]  Perry Cheng,et al.  Oil and water? High performance garbage collection in Java with MMTk , 2004, Proceedings. 26th International Conference on Software Engineering.

[4]  John Whaley,et al.  Dynamic Optimization through the use of Automatic Runtime Specialization , 1999 .

[5]  Brad Calder,et al.  Efficient procedure mapping using cache line coloring , 1997, PLDI '97.

[6]  Derek Bruening,et al.  Thread-shared software code caches , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[7]  Harish Patil,et al.  Ispike: A Post-link Optimizer for the Intel®Itanium®Architecture , 2004, CGO.

[8]  Matthew Arnold,et al.  Architecture and policy for adaptive optimization in virtual machines , 2004 .

[9]  J. Bradley Chen,et al.  Improving instruction locality with just-in-time code layout , 1997 .

[10]  Michael D. Smith,et al.  Procedure placement using temporal-ordering information , 1999, TOPL.

[11]  Kathryn S. McKinley,et al.  Dynamic SimpleScalar: Simulating Java Virtual Machines , 2003 .

[12]  Adam Welc,et al.  Improving virtual machine performance using a cross-run profile repository , 2005, OOPSLA '05.

[13]  Perry Cheng,et al.  Myths and realities: the performance impact of garbage collection , 2004, SIGMETRICS '04/Performance '04.

[14]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[15]  Eric Rotenberg,et al.  A Trace Cache Microarchitecture and Evaluation , 1999, IEEE Trans. Computers.

[16]  Norman Rubin,et al.  Spike: an optimizer for alpha/NT executables , 1997 .

[17]  Daniel J. Scales,et al.  Efficient Dynamic Procedure Placement , 1999 .

[18]  Mateo Valero,et al.  Software Trace Cache , 2014, IEEE Transactions on Computers.

[19]  Urs Hölzle,et al.  A Study of the Allocation Behavior of the SPECjvm98 Java Benchmark , 1999, ECOOP.

[20]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[21]  Perry Cheng,et al.  The garbage collection advantage: improving program locality , 2004, OOPSLA.

[22]  Kathryn S. McKinley,et al.  Dynamic code management: improving whole program code locality in managed runtimes , 2006, VEE '06.

[23]  Kim M. Hazelwood,et al.  A cross-architectural interface for code cache manipulation , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[24]  Scott McFarling,et al.  Program optimization for instruction caches , 1989, ASPLOS III.