Continuous program optimization: A case study

Much of the software in everyday operation is not making optimal use of the hardware on which it actually runs. Among the reasons for this discrepancy are hardware/software mismatches, modularization overheads introduced by software engineering considerations, and the inability of systems to adapt to users' behaviors.A solution to these problems is to delay code generation until load time. This is the earliest point at which a piece of software can be fine-tuned to the actual capabilities of the hardware on which it is about to be executed, and also the earliest point at wich modularization overheads can be overcome by global optimization.A still better match between software and hardware can be achieved by replacing the already executing software at regular intervals by new versions constructed on-the-fly using a background code re-optimizer. This not only enables the use of live profiling data to guide optimization decisions, but also facilitates adaptation to changing usage patterns and the late addition of dynamic link libraries.This paper presents a system that provides code generation at load-time and continuous program optimization at run-time. First, the architecture of the system is presented. Then, two optimization techniques are discussed that were developed specifically in the context of continuous optimization. The first of these optimizations continually adjusts the storage layouts of dynamic data structures to maximize data cache locality, while the second performs profile-driven instruction re-scheduling to increase instruction-level parallelism. These two optimizations have very different cost/benefit ratios, presented in a series of benchmarks. The paper concludes with an outlook to future research directions and an enumeration of some remaining research problems.The empirical results presented in this paper make a case in favor of continuous optimization, but indicate that it needs to be applied judiciously. In many situations, the costs of dynamic optimizations outweigh their benefit, so that no break-even point is ever reached. In favorable circumstances, on the other hand, speed-ups of over 120% have been observed. It appears as if the main beneficiaries of continuous optimization are shared libraries, which at different times can be optimized in the context of the currently dominant client application.

[1]  A. Cozzolino,et al.  Powerpc microprocessor family: the programming environments , 1994 .

[2]  Charles Consel,et al.  Efficient incremental run-time specialization for free , 1999, PLDI '99.

[3]  Thomas M. Conte,et al.  Accurate and practical profile-driven compilation using the profile buffer , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[4]  Thomas Ball,et al.  Edge profiling versus path profiling: the showdown , 1998, POPL '98.

[5]  Jack J. Dongarra,et al.  Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs , 1988, TOMS.

[6]  Urs Hölzle,et al.  Reconciling responsiveness with performance in pure object-oriented languages , 1996, TOPL.

[7]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[8]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[9]  Cindy Zheng,et al.  PA-RISC to IA-64: Transparent Execution, No Recompilation , 2000, Computer.

[10]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[11]  Dawson R. Engler,et al.  C: a language for high-level, efficient, and machine-independent dynamic code generation , 1995, POPL '96.

[12]  Craig Chambers,et al.  Towards better inlining decisions using inlining trials , 1994, LFP '94.

[13]  J. A. Lehmann,et al.  Comparisons of Distributed Operating System Performance Using the WPI Benchmark Suite , 1992 .

[14]  Markus Mock,et al.  A retrospective on: "an evaluation of staged run-time optimizations in DyC" , 2004, SIGP.

[15]  L. Peter Deutsch,et al.  Efficient implementation of the smalltalk-80 system , 1984, POPL.

[16]  Michael D. Smith,et al.  Ephemeral Instrumentation for Lightweight Program Profiling , 1997 .

[17]  Jeffrey Dean,et al.  ProfileMe: hardware support for instruction-level profiling on out-of-order processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[18]  Zheng Wang,et al.  System support for automatic profiling and optimization , 1997, SOSP.

[19]  Michael Franz,et al.  Oberon with Gadgets - A Simple Component Framework , 1999 .

[20]  Jong-Deok Choi,et al.  Escape analysis for Java , 1999, OOPSLA '99.

[21]  Scott A. Mahlke,et al.  Using Profile Information to Assist Advaced Compiler Optimization and Scheduling , 1992, LCPC.

[22]  Urs Hölzle,et al.  Adaptive optimization for self: reconciling high performance with exploratory programming , 1994 .

[23]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[24]  Scott A. Mahlke,et al.  Comparing static and dynamic code scheduling for multiple-instruction-issue processors , 1991, MICRO 24.

[25]  Raymond J. Hookway,et al.  DIGITAL FX!32: Combining Emulation and Binary Translation , 1997, Digit. Tech. J..

[26]  Vasanth Bala,et al.  Transparent Dynamic Optimization: The Design and Implementation of Dynamo , 1999 .

[27]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[28]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[29]  Erik R. Altman,et al.  LaTTe: a Java VM just-in-time compiler with fast and efficient register allocation , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[30]  Toshiaki Yasue,et al.  A dynamic optimization framework for a Java just-in-time compiler , 2001, OOPSLA '01.

[31]  S. Dutt New faster Kernighan-Lin-type graph-partitioning algorithms , 1993, Proceedings of 1993 International Conference on Computer Aided Design (ICCAD).

[32]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, TOCS.

[33]  Scott A. Mahlke,et al.  Profile‐guided automatic inline expansion for C programs , 1992, Softw. Pract. Exp..

[34]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[35]  Wen-mei W. Hwu,et al.  Speculative hedge: regulating compile-time speculation against profile variations , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[36]  Michael D. Smith,et al.  Procedure placement using temporal-ordering information , 1999, TOPL.

[37]  Shantanu Dutt New faster Kernighan-Lin-type graph-partitioning algorithms , 1993, ICCAD.

[38]  Gilbert Joseph Hansen,et al.  Adaptive systems for the dynamic run-time optimization of programs. , 1974 .

[39]  Erik R. Altman,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[40]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[41]  Michael Franz,et al.  Slim binaries , 1997, CACM.

[42]  Jürg Gutknecht Oberon System 3: Vision of a Future Software Technology , 1994, Softw. Concepts Tools.

[43]  Marc Michael Brandis Optimizing compilers for structured programming languages , 1995 .

[44]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[45]  ChambersCraig,et al.  Towards better inlining decisions using inlining trials , 1994 .

[46]  Henry S. Warren,et al.  Instruction Scheduling for the IBM RISC System/6000 Processor , 1990, IBM J. Res. Dev..

[47]  Michael Gschwind,et al.  Dynamic and Transparent Binary Translation , 2000, Computer.

[48]  Michael Franz,et al.  Continuous program optimization , 1999 .

[49]  Craig Chambers,et al.  Optimizing Dynamically-Typed Object-Oriented Languages With Polymorphic Inline Caches , 1991, ECOOP.

[50]  Amitabh Srivastava,et al.  Analysis Tools , 2019, Public Transportation Systems.

[51]  Ali-Reza Adl-Tabatabai,et al.  Fast, effective code generation in a just-in-time Java compiler , 1998, PLDI.

[52]  Bowen Alpern,et al.  Implementing jalapeño in Java , 1999, OOPSLA '99.

[53]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[54]  Peter Lee,et al.  Optimizing ML with run-time code generation , 1996, PLDI '96.

[55]  Keith D. Cooper,et al.  Combining analyses, combining optimizations , 1995, TOPL.

[56]  Michael Gschwind,et al.  Dynamic Binary Translation and Optimization , 2001, IEEE Trans. Computers.

[57]  James R. Larus,et al.  Using generational garbage collection to implement cache-conscious data placement , 1998, ISMM '98.

[58]  Vivek Sarkar,et al.  Jalape~ No | a Compiler-supported Java Tm Virtual Machine for Servers , 1999 .

[59]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[60]  Rajeev Motwani,et al.  Profile-driven instruction level parallel scheduling with application to super blocks , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[61]  Douglas J. Ingalls The execution time profile as a programming tool , 1971 .

[62]  Michael Franz,et al.  Automated data-member layout of heap objects to improve memory-hierarchy performance , 2000, TOPL.

[63]  James R. Larus,et al.  Optimally profiling and tracing programs , 1992, POPL '92.

[64]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[65]  Niklaus Wirth,et al.  The programming language oberon , 1988, Softw. Pract. Exp..

[66]  Karl Pettis,et al.  Profile guided code positioning , 1990, PLDI '90.

[67]  K. Ebcioglu,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[68]  Matthew Arnold,et al.  A framework for reducing the cost of instrumented code , 2001, PLDI '01.

[69]  Michael Steffen Oliver Franz,et al.  Code_generation On_the_fly: a Key to Portable Software , 1994 .

[70]  Craig Chambers,et al.  The design and implementation of the self compiler, an optimizing compiler for object-oriented programming languages , 1992 .

[71]  Guy L. Steele,et al.  The Java Language Specification , 1996 .

[72]  Urs Hölzle,et al.  Optimizing dynamically-dispatched calls with run-time type feedback , 1994, PLDI '94.

[73]  François Bodin,et al.  Improving cache behavior of dynamically allocated data structures , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[74]  Brian W. Kernighan,et al.  An efficient heuristic procedure for partitioning graphs , 1970, Bell Syst. Tech. J..

[75]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[76]  Brad Calder,et al.  Procedure placement using temporal ordering information , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[77]  Scott A. Mahlke,et al.  Using profile information to assist classic code optimizations , 1991, Softw. Pract. Exp..

[78]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[79]  Michael D. Smith,et al.  Better global scheduling using path profiles , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.