The performance impact analysis of loop unrolling

Loop unrolling is a well known technique, which usually results with speedup of a program that contains loops. The effect is obtained by reducing the operations that require counter increases and branch jumps at the end of the loops. This paper analyzes the impact of loop unrolling on various processor types and memory patterns. The experiments show a high correlation between the cache and the problem size. The loop unrolling results with a higher speedup for the execution of a smaller size problem, while it does not have impact for a problem whose size is greater than the capacity of the last level cache size, due to the huge number of cache misses. Another important result is that the loop unrolling achieves greater speedup on Intel, rather than AMD CPU. In this paper we analyze and discuss the various behaviors of loop unrolling.

[1]  J.L. Ayala,et al.  Optimal loop-unrolling mechanisms and architectural extensions for an energy-efficient design of shared register files in MPSoCs , 2005, Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05).

[2]  Alexander Aiken,et al.  Perfect Pipelining: A New Loop Parallelization Technique , 1988, ESOP.

[3]  Mark Stephenson,et al.  Predicting unroll factors using supervised classification , 2005, International Symposium on Code Generation and Optimization.

[4]  Preeti Ranjan Panda,et al.  The Impact of Loop Unrolling on Controller Delay in High Level Synthesis , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[5]  J. C. Huang,et al.  Generalized loop-unrolling: a method for program speedup , 1999, Proceedings 1999 IEEE Symposium on Application-Specific Systems and Software Engineering and Technology. ASSET'99 (Cat. No.PR00122).

[6]  Paul Lokuciejewski,et al.  Combining Worst-Case Timing Models, Loop Unrolling, and Static Loop Analysis for WCET Minimization , 2009, 2009 21st Euromicro Conference on Real-Time Systems.

[7]  Markus Kowarschik,et al.  An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms , 2002, Algorithms for Memory Hierarchies.

[8]  Alexandru Nicolau,et al.  Loop Quantization: A Generalized Loop Unwinding Technique , 1988, J. Parallel Distributed Comput..

[9]  Yong Dou,et al.  Impact of Loop Unrolling on Area, Throughput and Clock Frequency for Window Operations Based on a Data Schedule Method , 2008, 2008 Congress on Image and Signal Processing.

[10]  V. Strassen Gaussian elimination is not optimal , 1969 .

[11]  Sasko Ristov,et al.  Some optimization techniques of the matrix multiplication algorithm , 2013, Proceedings of the ITI 2013 35th International Conference on Information Technology Interfaces.

[12]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[13]  P. Sadayappan,et al.  Optimal loop unrolling for GPGPU programs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[14]  Todor Stefanov,et al.  Optimal Loop Unrolling and Shifting for Reconfigurable Architectures , 2009, TRETS.

[15]  Dean M. Tullsen,et al.  The effect of compiler optimizations on Pentium 4 power consumption , 2003, Seventh Workshop on Interaction Between Compilers and Computer Architectures, 2003. INTERACT-7 2003. Proceedings..

[16]  Philip H. Sweany,et al.  Optimizing loop performance for clustered VLIW architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[17]  Sasko Ristov,et al.  Hybrid 2D/1D Blocking as Optimal Matrix-Matrix Multiplication , 2012, ICT Innovations.

[18]  A. Jefferson Offutt,et al.  Using compiler optimization techniques to detect equivalent mutants , 1994, Softw. Test. Verification Reliab..

[19]  Sasko Ristov,et al.  Matrix multiplication performance analysis in virtualized shared memory multiprocessor , 2012, 2012 Proceedings of the 35th International Convention MIPRO.

[20]  Sasko Ristov,et al.  Affinity-aware HPC applications in multichip and multicore multiprocessor , 2013, Proceedings of the ITI 2013 35th International Conference on Information Technology Interfaces.

[21]  Geng Liu,et al.  Algorithm and Data Optimization Techniques for Scaling to Massively Threaded Systems , 2012, Computer.

[22]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[23]  S. Hiroyuki,et al.  Characteristics of loop unrolling effect: software pipelining and memory latency hiding , 2001, 2001 Innovative Architecture for Future Generation High-Performance Processors and Systems.

[24]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[25]  Sasko Ristov,et al.  Loosely or tightly coupled affinity for matrix - Vector multiplication , 2013, 2013 36th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[26]  Koen Bertels,et al.  Loop distribution for K-loops on Reconfigurable Architectures , 2011, 2011 Design, Automation & Test in Europe.

[27]  Vania Marangozova-Martin,et al.  BOAST: Bringing Optimization through Automatic Source-to-Source Transformations , 2013, 2013 IEEE 7th International Symposium on Embedded Multicore Socs.

[28]  Peter Luksch,et al.  An Improving Method for Loop Unrolling , 2013, ArXiv.

[29]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[30]  M. Gusev,et al.  Achieving maximum performance for matrix multiplication using set associative cache , 2012, 2012 8th International Conference on Computing Technology and Information Management (NCM and ICNIT).