Trade-offs in loop transformations

Nowadays, multimedia systems deal with huge amounts of memory accesses and large memory footprints. To alleviate the impact of these accesses and reduce the memory footprint, high-level memory exploration and optimization techniques have been proposed. These techniques try to more efficiently utilize the memory hierarchy. An important step in these optimization techniques are loop transformations (LT). They have a crucial effect on later data memory footprint optimization steps and code generation. However, the state-of-the-art work has focused only on individual objectives. The main one in literature involves improving the locality of data accesses, and thus reducing the data memory footprint. It does not consider the trade-offs in the LT step in relation to successive optimization steps. Therefore, it is not globally efficient in mapping the application on the target platform. In this article we will discuss several trade-offs during the loop transformations. To our knowledge, we are the first ones considering these global trade-offs. Previous work always gave mostly one solution, having the best locality and thus the optimized memory footprint, even though some research in two-dimensional trade-offs in this area exists as well. We start from this state-of-the-art solution with minimal footprint. We show that by sacrificing the footprint, we can obtain gains in data reuse (crucial for energy reduction) and reduce the control-flow complexity. We demonstrate our approach on a real-life application, namely the QSDPCM video coder. At the end, we show that considering trade-offs for this application leads to 16% energy reduction in a two-layer memory subsystem and 10% cycle reduction on the ARM platform.

[1]  H.J. De Man,et al.  Modeling data flow and control flow for high level memory management , 1992, [1992] Proceedings The European Conference on Design Automation.

[2]  Rudolf Eigenmann,et al.  Automatic program parallelization , 1993, Proc. IEEE.

[3]  Hugo De Man,et al.  Architecture-driven synthesis techniques for VLSI implementation of DSP algorithms , 1990, Proc. IEEE.

[4]  Frédéric Vivien,et al.  A unified framework for schedule and storage optimization , 2001, PLDI '01.

[5]  Henk Corporaal,et al.  Dealing with data dependent conditions to enable general global source code transformations , 2009, Int. J. Embed. Syst..

[6]  Rudy Lauwereins,et al.  Energy-Aware Runtime Scheduling for Embedded-Multiprocessor SOCs , 2001, IEEE Des. Test Comput..

[7]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[8]  Hugo De Man,et al.  A preprocessing step for global loop transformations for data transfer optimization , 2000, CASES '00.

[9]  Yves Robert,et al.  Affine-by-Statement Scheduling of Uniform and Affine Loop Nests over Parametric , 1995, J. Parallel Distributed Comput..

[10]  Sven Verdoolaege Loop transformations for data transfer and storage optimization , 2002 .

[11]  Erik Brockmeyer,et al.  Data Access and Storage Management for Embedded Programmable Processors , 2002, Springer US.

[12]  Cheng Wang,et al.  Data locality enhancement by memory reduction , 2001, ICS '01.

[13]  Michael F. P. O'Boyle,et al.  Array recovery and high-level transformations for DSP applications , 2003, TECS.

[14]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[15]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[16]  Constantine D. Polychronopoulos Compiler Optimizations for Enhancing Parallelism and Their Impact on Architecture Design , 1988, IEEE Trans. Computers.

[17]  Henk Corporaal,et al.  Combining data and instruction memory energy optimizations for embedded applications , 2005, 3rd Workshop on Embedded Systems for Real-Time Multimedia, 2005..

[18]  Sanjay V. Rajopadhye,et al.  Generation of Efficient Nested Loops from Polyhedra , 2000, International Journal of Parallel Programming.

[19]  Heiko Falk,et al.  Control Flow Optimization by Loop Nest Splitting at the Source Code Level , 2002 .

[20]  CatthoorFrancky,et al.  Trade-offs in loop transformations , 2009 .

[21]  William Pugh,et al.  A practical algorithm for exact array dependence analysis , 1992, CACM.

[22]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[23]  Anne Mignotte,et al.  Loop alignment for memory accesses optimization , 1999, Proceedings 12th International Symposium on System Synthesis.

[24]  Keshav Pingali,et al.  A Singular Loop Transformation Framework Based on Non-Singular Matrices , 1992, LCPC.

[25]  Per Gunnar Kjeldsberg Storage Requirement Estimation and Optimization for Data Intensive Applications , 2001 .

[26]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[27]  Henk Corporaal,et al.  Advanced copy propagation for arrays , 2003, LCTES '03.

[28]  Liesbet Van der Perre,et al.  Design of a Low Power Pre-synchronization ASIP for Multimode SDR Terminals , 2007, SAMOS.

[29]  FeautrierPaul Some efficient solutions to the affine scheduling problem , 1992 .

[30]  Giovanni De Micheli,et al.  SpC: synthesis of pointers in C: application of pointer analysis to the behavioral synthesis from C , 1998, ICCAD.

[31]  Teresa H. Meng,et al.  Portable video-on-demand in wireless communication , 1995, Proc. IEEE.

[32]  Vivek Sarkar,et al.  Optimization of array accesses by collective loop transformations , 1991, ICS '91.

[33]  Rudy Lauwereins,et al.  Data reuse exploration techniques for loop-dominated applications , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[34]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[35]  Hai Zhou,et al.  Parallel CAD: Algorithm Design and Programming Special Section Call for Papers TODAES: ACM Transactions on Design Automation of Electronic Systems , 2010 .

[36]  Corinne Ancourt,et al.  Automatic data mapping of signal processing applications , 1997, Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors.

[37]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[38]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[39]  Sharad Malik,et al.  Flexible and formal modeling of microprocessors with application to retargetable simulation , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[40]  Keshav Pingali,et al.  A singular loop transformation framework based on non-singular matrices , 1992, International Journal of Parallel Programming.

[41]  W. Pugh,et al.  A framework for unifying reordering transformations , 1993 .

[42]  Guang R. Gao,et al.  Collective Analysis and Transformation of Loop Clusters , 1992 .

[43]  Qubo Hu,et al.  Hierarchical Memory Size Estimation for Loop Transformation and Data Memory Platform Optimization , 2007 .

[44]  Gerda Janssens,et al.  Multi-dimensional incremental loop fusion for data locality , 2003, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003.

[45]  Erik Brockmeyer,et al.  Layer assignment techniques for low energy in multi-layered memory organisations , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[46]  Vincent Loechner,et al.  Parametric Analysis of Polyhedral Iteration Spaces , 1998, J. VLSI Signal Process..

[47]  G. De Micheli,et al.  SpC: synthesis of pointers in C application of pointer analysis to the behavioral synthesis from C , 1998, 1998 IEEE/ACM International Conference on Computer-Aided Design. Digest of Technical Papers (IEEE Cat. No.98CB36287).

[48]  Patrice Quinton,et al.  The Alpha du Centaur experiment , 1992 .

[49]  Mahmut T. Kandemir,et al.  A Layout-Conscious Iteration Space Transformation Technique , 2001, IEEE Trans. Computers.

[50]  Doran Wilde,et al.  A LIBRARY FOR DOING POLYHEDRAL OPERATIONS , 2000 .

[51]  Hugo De Man,et al.  Memory Size Reduction Through Storage Order Optimization for Embedded Parallel Multimedia Applications , 1997, Parallel Comput..

[52]  Heiko Falk,et al.  Control Flow Driven Splitting of Loop Nests at the Source Code Level , 2003, DATE.

[53]  Albert Cohen,et al.  Putting Polyhedral Loop Transformations to Work , 2003, LCPC.

[54]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..