Source Code Optimization Techniques for Data Flow Dominated Embedded Software

List Of Figures. List Of Tables. Acknowledgments. Foreword. 1. Introduction. 1.1. Why Source Code Optimization? 1.1.1. Abstraction Levels Of Code Optimization. 1.1.2. Survey Of The Traditional Code Optimization Process. 1.1.3. Scopes For Code Optimization. 1.2. Target Application Domain. 1.3. Goals And Contributions. 1.4. Outline Of The Book. 2. Existing Code Optimization Techniques. 2.1. Description Optimization. 2.2. Algorithm Selection. 2.3. Memory Hierarchy Exploitation. 2.4. Processor -Independent Source Code Optimizations. 2.5. Processor- Specific Source Code Optimizations. 2.6. Compiler Optimizations. 2.6.1. Loop Optimizations For High Performance Computing. 2.6.2. Code Generation For Embedded Processors. 3. Fundamental Concepts For Optimization And Evaluation. 3.1. Polyhedral Modeling. 3.2. Optimization Using Genetic Algorithms. 3.3. Benchmarking Methodology. 3.3.1. Profiling Of Pipeline And Cache Performance. 3.3.2. Compilation For Runtime And Code Size Measurement. 3.3.3. Estimation Of Energy Dissipation. 3.4. Summary. 4. Intermediate Representations. 4.1. Low-Level Intermediate Representations. 4.1.1. GNU RTL. 4.1.2. Trimaran ELCOR IR. 4.2. Medium-Level Intermediate Representations. 4.2.1. Sun IR. 4.2.2. IR-C / LANCE. 4.3. High- Level Intermediate Representations. 4.3.1. SUIF. 4.3.2. IMPACT. 4.4. Selection Of An IR For Source Code Optimization. 4.5. Summary. 5. Loop Nest Splitting. 5.1. Introduction. 5.1.1. Control Flow Overhead In Data Dominated Software. 5.1.2. Control Flow Overhead Caused By Data Partitioning. 5.1.3. Splitting Of Loop Nests For Control Flow Optimization. 5.2. Related Work. 5.3. Analysis And Optimization Techniques For Loop Nest Splitting. 5.3.1. Preliminaries. 5.3.2. Condition Satisfiability. 5.3.3. Condition Optimization. 5.3.3.1. Chromosomal Representation. 5.3.3.2. Fitness Function. 5.3.3.3. Polytope Generation. 5.3.4. Global Search Space Construction. 5.3.5. Global Search Space Exploration. 5.3.5.1. Chromosomal Representation. 5.3.5.2. Fitness Function. 5.3.6. Source Code Transformation. 5.3.6.1. Generation Of The Splitting If-Statement. 5.3.6.2. Loop Nest Duplication. 5.4. Extensions For Loops With Non-Constant Bounds. 5.5. Experimental Results. 5.5.1. Stand-Alone Loop Nest Splitting. 5.5.1.1. Pipeline And Cache Performance. 5.5.1.2. Execution Times And Code Sizes. 5.5.1.3. Energy Consumption. 5.5.2. Combined Data Partitioning And Loop Nest Splitting For Energy-Efficient Scratchpad Utilization. 5.5.2.1. Execution Times And Code Sizes. 5.5.2.2. Energy Consumption. 5.6. Summary. 6. Advanced Code Hoisting. 6.1. A Motivating Example. 6.2. Related Work. 6.3. Analysis Techniques For Advanced Code Hoisting. 6.3.1. Common Subexpression Identification. 6.3.1.1. Collection Of Equivalent Expressions. 6.3.1.2. Computation Of Live Ranges Of Expressions. 6.3.2. Determination Of The Outermost Loop For A CSE. 6.3.3. Computation Of Execution Frequencies Using Polytope Models. 6.4. Experimental Results. 6.4.1. Pipeline And Cache Performance. 6.4.2. Execution Times And Code Sizes. 6.4.3. Energy Consumption. 6.5. Summary. 7. Ring Buffer Replacement. 7.1. Motivation. 7.2. Optimization Steps. 7.2.1. Ring Buffer Scalarization. 7.2.2. Loop Unrolling For Ring Buffers. 7.3. Experimental Results. 7.3.1. Pipeline And Cache Performance. 7.3.2. Execution Times And Code Sizes. 7.3.3. Energy Consumption. 7.4. Summary. 8. Summary And Conclusions. 8.1. Summary And Contribution To Research. 8.2. Future Work. Appendices: Experimental Comparison Of SUIF And IR-C / LANCE. Benchmarking Data For Loop Nest Splitting. B.1. Values Of Performance-Monitoring Counters. B.1.1. Intel Pentium III. B.1.2. Sun Ultrasparc III. B.1.3. MIPS R10000. B.2. Execution Times And Code Sizes. B.3. Energy Consumption Of An ARM7TDMI Core. B.4. Combined Data Partitioning And Loop Nest Split

[1]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[2]  Vincent Loechner,et al.  Precise Data Locality Optimization of Nested Loops , 2004, The Journal of Supercomputing.

[3]  K. Rimey,et al.  Lazy data routing and greedy scheduling for application-specific signal processors , 1988, MICRO 1988.

[4]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[5]  Alexandru Nicolau,et al.  Memory Issues in Embedded Systems-on-Chip , 1999 .

[6]  Hugo De Man,et al.  High-level address optimization and synthesis techniques for data-transfer-intensive applications , 1998, IEEE Trans. Very Large Scale Integr. Syst..

[7]  R. Leupers,et al.  High-level Control Flow Transformations for Performance Improvement of Address-Dominated Multimedia Applications , 2003 .

[8]  Barbara G. Ryder,et al.  A schema for interprocedural modification side-effect analysis with pointer aliasing , 2001, TOPL.

[9]  Narayanan Vijaykrishnan,et al.  Effect of compiler optimizations on memory energy , 2000, 2000 IEEE Workshop on SiGNAL PROCESSING SYSTEMS. SiPS 2000. Design and Implementation (Cat. No.00TH8528).

[10]  José M. F. Moura,et al.  Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[11]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[12]  Keith D. Cooper,et al.  Optimizing for reduced code space using genetic algorithms , 1999, LCTES '99.

[13]  Rainer Leupers,et al.  Phase-Coupled Mapping of Data Flow Graphs to Irregular Data Paths , 1999, Des. Autom. Embed. Syst..

[14]  Vikram S. Adve,et al.  Advanced Code Generation for High Performance Fortran , 2001, Compiler Optimizations for Scalable Parallel Systems Languages.

[15]  J. Gyllenhaal,et al.  EMULATION OF THE INTERMEDIATE REPRESENTATION IN THE IMPACT COMPILER BY QUDUS , 1998 .

[16]  Francky Catthoor,et al.  Custom Memory Management Methodology , 1998, Springer US.

[17]  Alfred V. Aho,et al.  Code generation using tree matching and dynamic programming , 1989, ACM Trans. Program. Lang. Syst..

[18]  Anne Mignotte,et al.  Loop alignment for memory accesses optimization , 1999, Proceedings 12th International Symposium on System Synthesis.

[19]  Rainer Leupers,et al.  An Executable Intermediate Representation for Retargetable Compilation and High-Level Code Optimization , 2003 .

[20]  M. Bister,et al.  Automated segmentation of cardiac MR images , 1989, [1989] Proceedings. Computers in Cardiology.

[21]  Marc E. Pfetsch,et al.  Some Algorithmic Problems in Polytope Theory , 2003, Algebra, Geometry, and Software Systems.

[22]  Mircea R. Stan,et al.  Bus-invert coding for low-power I/O , 1995, IEEE Trans. Very Large Scale Integr. Syst..

[23]  Chantal Ykman-Couvreur,et al.  Analyzing energy friendly steady state phases of dynamic application execution in terms of sparse data structures , 2002, ISLPED '02.

[24]  H. Raiffa,et al.  3. The Double Description Method , 1953 .

[25]  Peter Marwedel,et al.  Assigning program and data objects to scratchpad for energy reduction , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[26]  Hugo De Man,et al.  Platform Independent Data Transfer and Storage Exploration Illustrated on Parallel Cavity Detection Algorithm , 1999, PDPTA.

[27]  Hugo De Man,et al.  Formalized methodology for data reuse: exploration for low-power hierarchical memory mappings , 1998, IEEE Trans. Very Large Scale Integr. Syst..

[28]  Gerhard Fettweis,et al.  Low-energy DSP code generation using a genetic algorithm , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[29]  Brian W. Kernighan,et al.  The C Programming Language , 1978 .

[30]  Hugo De Man,et al.  Power exploration for data dominated video applications , 1996, ISLPED '96.

[31]  Peter Marwedel,et al.  Phase coupled code generation for DSPs using a genetic algorithm , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[32]  D. Verkest,et al.  Systematic high-level address code transformations for piece-wise linear indexing: illustration on a medical imaging algorithm , 2000, 2000 IEEE Workshop on SiGNAL PROCESSING SYSTEMS. SiPS 2000. Design and Implementation (Cat. No.00TH8528).

[33]  Mahmut T. Kandemir,et al.  Dynamic management of scratch-pad memory space , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[34]  Peter Marwedel,et al.  Low Power Code Generation for a RISC Processor by Register Pipelining , 2001 .

[35]  Heiko Falk,et al.  Combined Data Partitioning and Loop Nest Splitting for Energy Consumption Minimization , 2004, SCOPES.

[36]  Gert Goossens,et al.  Embedded software in real-time signal processing systems: application and architecture trends , 1997 .

[37]  Vijay K. Madisetti,et al.  Software Streaming via Block Streaming , 2003, Embedded Software for SoC.

[38]  Rainer Leupers,et al.  A uniform optimization technique for offset assignment problems , 1998, Proceedings. 11th International Symposium on System Synthesis (Cat. No.98EX210).

[39]  Gilles Pokam,et al.  SWARP: a retargetable preprocessor for multimedia instructions , 2004, Concurr. Comput. Pract. Exp..

[40]  Rainer Leupers,et al.  Retargetable Compiler Technology for Embedded Systems , 2001, Springer US.

[41]  Erik Brockmeyer,et al.  Layer assignment techniques for low energy in multi-layered memory organisations , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[42]  Vincent Loechner,et al.  Parametric Analysis of Polyhedral Iteration Spaces , 1998, J. VLSI Signal Process..

[43]  Giovanni De Micheli,et al.  Low power embedded software optimization using symbolic algebra , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[44]  F. Catthoor,et al.  Analysis of high-level address code transformations for programmable processors , 2000, Proceedings Design, Automation and Test in Europe Conference and Exhibition 2000 (Cat. No. PR00537).

[45]  Erik Brockmeyer,et al.  Data Access and Storage Management for Embedded Programmable Processors , 2002, Springer US.

[46]  Dimitrios Soudris,et al.  A code transformation-based methodology for improving I-cache performance of DSP applications , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[47]  Diederik Verkest,et al.  Systematic speed-power memory data-layout exploration for cache controlled embedded multimedia applications , 2001, International Symposium on System Synthesis (IEEE Cat. No.01EX526).

[48]  Thomas Lindgren,et al.  Allocation of Global Data Objects in On-Chip RAM , 1998 .

[49]  Ralf Niemann Hardware, software co-design for data flow dominated embedded systems , 1998 .

[50]  Peter Marwedel,et al.  Fast, predictable and low energy memory references through architecture-aware compilation , 2004, ASP-DAC 2004: Asia and South Pacific Design Automation Conference 2004 (IEEE Cat. No.04EX753).

[51]  Henk Corporaal,et al.  Transformatiing and Parallelizing ANSI C Programs using Pattern Recognition , 1999, HPCN Europe.

[52]  Anantha Chandrakasan,et al.  Algorithmic transforms for efficient energy scalable computation , 2000, ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514).

[53]  Hugo De Man,et al.  Memory Size Reduction Through Storage Order Optimization for Embedded Parallel Multimedia Applications , 1997, Parallel Comput..

[54]  Heiko Falk,et al.  Control Flow Driven Splitting of Loop Nests at the Source Code Level , 2003, DATE.