Fast-path loop unrolling of non-counted loops to enable subsequent compiler optimizations

Java programs can contain non-counted loops, that is, loops for which the iteration count can neither be determined at compile time nor at run time. State-of-the-art compilers do not aggressively optimize them, since unrolling non-counted loops often involves duplicating also a loop's exit condition, which thus only improves run-time performance if subsequent compiler optimizations can optimize the unrolled code. This paper presents an unrolling approach for non-counted loops that uses simulation at run time to determine whether unrolling such loops enables subsequent compiler optimizations. Simulating loop unrolling allows the compiler to determine performance and code size effects for each potential transformation prior to performing it. We implemented our approach on top of the GraalVM, a high-performance virtual machine for Java, and evaluated it with a set of Java and JavaScript benchmarks in terms of peak performance, compilation time and code size increase. We show that our approach can improve performance by up to 150% while generating a median code size and compile-time increase of not more than 25%. Our results indicate that fast-path unrolling of non-counted loops can be used in practice to increase the performance of Java applications.

[1]  Matthias Hauswirth,et al.  Use at your own risk: the Java unsafe API in the wild , 2015, OOPSLA.

[2]  Thomas Würthinger,et al.  Making collection operations optimal with aggressive JIT compilation , 2017, SCALA@SPLASH.

[3]  Andreas Krall,et al.  Compilation Techniques for Multimedia Processors , 2004, International Journal of Parallel Programming.

[4]  Sanjay Jinturkar,et al.  Aggressive Loop Unrolling in a Retargetable Optimizing Compiler , 1996, CC.

[5]  Vivek Sarkar Optimized unrolling of nested loops , 2000, ICS '00.

[6]  Hanspeter Mössenböck,et al.  An object storage model for the truffle language implementation framework , 2014, PPPJ '14.

[7]  Christopher A. Vick,et al.  The Java HotSpotTM Server Compiler , 2001 .

[8]  J. C. Huang,et al.  Generalized loop-unrolling: a method for program speedup , 1999, Proceedings 1999 IEEE Symposium on Application-Specific Systems and Software Engineering and Technology. ASSET'99 (Cat. No.PR00122).

[9]  Hanspeter Mössenböck,et al.  The taming of the shrew: increasing performance by automatic parameter tuning for java garbage collectors , 2014, ICPE.

[10]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[11]  Jing Wang,et al.  Loop-carried dependence and the general URPR software pipelining approach (unrolling, pipelining and rerolling) , 1991, Proceedings of the Twenty-Fourth Annual Hawaii International Conference on System Sciences.

[12]  Cliff Click,et al.  Global code motion/global value numbering , 1995, PLDI '95.

[13]  Prasad A. Kulkarni,et al.  AOT vs. JIT: impact of profile data on code quality , 2017, LCTES.

[14]  Sharad Malik,et al.  Performance estimation of embedded software with instruction cache modeling , 1999, TODE.

[15]  Stamatis Vassiliadis,et al.  Instruction-level parallel processors , 1995 .

[16]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[17]  Roy Dz-Ching Ju,et al.  A compiler framework for speculative analysis and optimizations , 2003, PLDI '03.

[18]  Hanspeter Mössenböck,et al.  Partial Escape Analysis and Scalar Replacement for Java , 2014, CGO '14.

[19]  Andreu Carminati,et al.  Combining loop unrolling strategies and code predication to reduce the worst-case execution time of real-time software , 2017 .

[20]  Jack W. Davidson,et al.  An Aggressive Approach to Loop Unrolling , 2001 .

[21]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[22]  Hanspeter Mössenböck,et al.  Dominance-based duplication simulation (DBDS): code duplication to enable compiler optimizations , 2018, CGO.

[23]  Michael Wolfe,et al.  Beyond induction variables , 1992, PLDI '92.

[24]  Andreas Schörgenhumer,et al.  Efficient Tracing and Versatile Analysis of Lock Contention in Java Applications on the Virtual Machine Level , 2016, ICPE.

[25]  Alon Zakai,et al.  Bringing the web up to speed with WebAssembly , 2017, PLDI.

[26]  Alan E. Charlesworth,et al.  An Approach to Scientific Array Processing: The Architectural Design of the AP-120B/FPS-164 Family , 1981, Computer.

[27]  Amer Diwan,et al.  Type-based alias analysis , 1998, PLDI.

[28]  Frank Yellin,et al.  The Java Virtual Machine Specification , 1996 .

[29]  Thomas Würthinger Dynamic code evolution for Java , 2010, PPPJ.

[30]  Yoshihiko Futamura,et al.  Partial Evaluation of Computation Process--An Approach to a Compiler-Compiler , 1999, High. Order Symb. Comput..

[31]  Alon Zakai Emscripten: an LLVM-to-JavaScript compiler , 2011, OOPSLA Companion.

[32]  Christian Wimmer,et al.  One VM to rule them all , 2013, Onward!.

[33]  Steven S. Muchnick,et al.  Efficient instruction scheduling for a pipelined architecture , 1986, SIGPLAN '86.

[34]  Jack W. Davidson,et al.  Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation , 1995, MICRO.

[35]  Yi Lin,et al.  Stop and go: understanding yieldpoint behavior , 2015, ISMM.

[36]  Mira Mezini,et al.  Da capo con scala: design and analysis of a scala benchmark suite for the java virtual machine , 2011, OOPSLA '11.

[37]  David C. Hoaglin,et al.  Some Implementations of the Boxplot , 1989 .

[38]  Ahmed El-Mahdy,et al.  Automatic Vectorization Using Dynamic Compilation and Tree Pattern Matching Technique in Jikes RVM , 2009 .

[39]  Christian Wimmer,et al.  Self-optimizing AST interpreters , 2012, DLS.

[40]  Hanspeter Mössenböck,et al.  Graal IR : An Extensible Declarative Intermediate Representation , 2013 .

[41]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[42]  Philip H. Sweany,et al.  Improving software pipelining with unroll-and-jam , 1996, Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences.

[43]  Hanspeter Mössenböck,et al.  An experimental study of the influence of dynamic compiler optimizations on Scala performance , 2013, SCALA@ECOOP.

[44]  Christian Wimmer,et al.  Practical partial evaluation for high-performance dynamic language runtimes , 2017, PLDI.

[45]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.