Performance and memory space optimizations for embedded systems

Embedded systems have three common principles: real-time performance, low power consumption, and low price (limited hardware). Embedded computers use chip multiprocessors (CMPs) to meet these expectations. However, one of the major problems is lack of efficient software support for CMPs; in particular, automated code parallelizers are needed. The aim of this study is to explore various ways to increase performance, as well as reducing resource usage and energy consumption for embedded systems. We use code restructuring, loop scheduling, data transformation, code and data placement, and scratch-pad memory (SPM) management as our tools in different embedded system scenarios. The majority of our work is focused on loop scheduling. Main contributions of our work are: We propose a memory saving strategy that exploits the value locality in array data by storing arrays in a compressed form. Based on the compressed forms of the input arrays, our approach automatically determines the compressed forms of the output arrays and also automatically restructures the code. We propose and evaluate a compiler-directed code scheduling scheme, which considers both parallelism and data locality. It analyzes the code using a locality-parallelism graph representation, and assigns the nodes of this graph to processors. We also introduce an Integer Linear Programming based formulation of the scheduling problem. We propose a compiler-based SPM conscious loop scheduling strategy for array/loop based embedded applications. The method is to distribute loop iterations across parallel processors in an SPM-conscious manner. The compiler identifies potential SPM hits and misses, and distributes loop iterations such that the processors have close execution times. We present an SPM management technique using Markov chain based data access prediction for irregular accesses. We propose a compiler directed integrated code and data placement scheme for 2-D mesh based CMP architectures. Using a Code-Data Affinity Graph (CDAG) to represent the relationship between loop iterations and array data, it assigns the sets of loop iterations to processing cores and sets of data blocks to on-chip memories. We present a memory bank aware dynamic loop scheduling scheme for array-intensive applications. The goal is to minimize the number of memory banks needed for executing the group of loop iterations.

[1]  Michael F. P. O'Boyle A hierarchical locality algorithm for NUMA compilation , 1995, Proceedings Euromicro Workshop on Parallel and Distributed Processing.

[2]  Christoforos E. Kozyrakis,et al.  Comparing memory systems for chip multiprocessors , 2007, ISCA '07.

[3]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[4]  Martin Hopkins,et al.  A novel SIMD architecture for the cell heterogeneous chip-multiprocessor , 2005, 2005 IEEE Hot Chips XVII Symposium (HCS).

[5]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[6]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[7]  Mo Chen,et al.  The Importance of Data Compression for Energy Efficiency in Sensor Networks , 2003 .

[8]  Mahmut T. Kandemir,et al.  A Memory-Conscious Code Parallelization Scheme , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[9]  Mahmut T. Kandemir,et al.  An energy saving strategy based on adaptive loop parallelization , 2002, DAC '02.

[10]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[11]  Luca Benini,et al.  Hardware-assisted data compression for energy minimization in systems with embedded processors , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[12]  Nectarios Koziris,et al.  Automatic parallel code generation for tiled nested loops , 2004, SAC '04.

[13]  Lin Gao,et al.  Memory coloring: a compiler approach for scratchpad memory management , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[14]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[15]  Yuan Xie,et al.  Profile-Driven Selective Code Compression , 2003, DATE.

[16]  Yuan Xie,et al.  LZW-based code compression for VLIW embedded systems , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[17]  Mahmut T. Kandemir,et al.  SPM conscious loop scheduling for embedded chip multiprocessors , 2006, 12th International Conference on Parallel and Distributed Systems - (ICPADS'06).

[18]  Andrew Wolfe,et al.  Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture , 2000, MICRO 2000.

[19]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[20]  Laura Ricci,et al.  Automatic loop parallelization: an abstract interpretation approach , 2002, Proceedings. International Conference on Parallel Computing in Electrical Engineering.

[21]  Constantine D. Polychronopoulos,et al.  Parallel programming and compilers , 1988 .

[22]  Henk Sips,et al.  A Unified Compiler Framework for Work and Data Placement , 2001 .

[23]  Mahmut T. Kandemir,et al.  Influence of Loop Optimizations on Energy Consumption of Multi-bank Memory Systems , 2002, CC.

[24]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[25]  Mahmut T. Kandemir,et al.  Dynamic management of scratch-pad memory space , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[26]  Mahmut Kandemir,et al.  SPM management using Markov chain based data access prediction , 2008, ICCAD 2008.

[27]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[28]  Sang Lyul Min,et al.  A dynamic code placement technique for scratchpad memory using postpass optimization , 2006, CASES '06.

[29]  Yunheung Paek,et al.  Software controlled memory layout reorganization for irregular array access patterns , 2007, CASES '07.

[30]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[31]  Wayne H. Wolf,et al.  SAMC: a code compression algorithm for embedded processors , 1999, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[32]  Rajeev Barua,et al.  Heap data allocation to scratch-pad memory in embedded systems , 2005, J. Embed. Comput..

[33]  Erik Brockmeyer,et al.  Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[34]  Yijun Yu,et al.  Loop Parallelization using the 3D Iteration Space Visualizer , 2001, J. Vis. Lang. Comput..

[35]  Rajeev Barua,et al.  Scratch-pad memory allocation without compiler support for java applications , 2007, CASES '07.

[36]  Kiran Bondalapati Parallelizing DSP nested loops on reconfigurable architectures using data context switching , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[37]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[38]  Yuan Xie,et al.  Code Compression for VLIW Processors , 2001, Data Compression Conference.

[39]  Peter Marwedel,et al.  Data partitioning for maximal scratchpad usage , 2003, ASP-DAC '03.

[40]  Rudolf Eigenmann,et al.  SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[41]  Giovanni De Micheli,et al.  Synthesis and Optimization of Digital Circuits , 1994 .

[42]  Kurt Keutzer,et al.  Code density optimization for embedded DSP processors using data compression techniques , 1995, Proceedings Sixteenth Conference on Advanced Research in VLSI.

[43]  Sanjay J. Patel,et al.  Implicitly Parallel Programming Models for Thousand-Core Microprocessors , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[44]  Rudy Lauwereins,et al.  Energy-Aware Runtime Scheduling for Embedded-Multiprocessor SOCs , 2001, IEEE Des. Test Comput..

[45]  Juan Touriño,et al.  A GSA-based compiler infrastructure to extract parallelism from complex loops , 2003, ICS '03.

[46]  Eftychios Sifakis,et al.  Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors , 2007, ISCA '07.

[47]  L.M. Ni,et al.  Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers , 1993, IEEE Trans. Parallel Distributed Syst..

[48]  John Zahorjan,et al.  Optimizing Data Locality by Array Restructuring , 1995 .

[49]  Sung-Mo Kang,et al.  Effective algorithms for cache-level compression , 2001, GLSVLSI '01.

[50]  Monica S. Lam,et al.  Locality Optimizations for Parallel Machines , 1994, CONPAR.

[51]  Mahmut T. Kandemir,et al.  Compiler-directed scratch pad memory hierarchy design and management , 2002, DAC '02.

[52]  Anoop Gupta,et al.  Scheduling and page migration for multiprocessor compute servers , 1994, ASPLOS VI.

[53]  Evangelos P. Markatos,et al.  Using Processor Affinity in Loop Scheduling , 1994 .

[54]  Mahmut T. Kandemir,et al.  Dynamic Scratch-Pad Memory Management for Irregular Array Access Patterns , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[55]  Evangelos P. Markatos,et al.  Load Balancing vs. Locality Management in Shared-Memory Multiprocessors , 1992, ICPP.

[56]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[57]  Francky Catthoor,et al.  Compiler-Based Approach for Exploiting Scratch-Pad in Presence of Irregular Array Access , 2005, Design, Automation and Test in Europe.

[58]  Enrico Macii,et al.  Architectural Leakage-Aware Management of Partitioned Scratchpad Memories , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[59]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[60]  Larry Carter,et al.  On the Parallel Execution Time of Tiled Loops , 2003, IEEE Trans. Parallel Distributed Syst..

[61]  Mats Brorsson,et al.  Performance Impact of Code and Data Placement on the IBM RP3 , 1989 .

[62]  Rick Hetherington The UltraSPARC T 1 Processor-Power Efficient Throughput Computing , 2004 .

[63]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[64]  Vincent Loechner,et al.  Parametric Analysis of Polyhedral Iteration Spaces , 1998, J. VLSI Signal Process..

[65]  Mahmut T. Kandemir,et al.  A compiler algorithm for optimizing locality in loop nests , 1997, ICS '97.

[66]  Luca Benini,et al.  An integrated hardware/software approach for run-time scratchpad management , 2004, Proceedings. 41st Design Automation Conference, 2004..

[67]  Keith D. Cooper,et al.  Enhanced code compression for embedded RISC processors , 1999, PLDI '99.

[68]  Mary Jane Irwin,et al.  Integrated code and data placement in two-dimensional mesh based chip multiprocessors , 2008, ICCAD 2008.

[69]  Keshav Pingali,et al.  Access normalization: loop restructuring for NUMA compilers , 1992, ASPLOS V.

[70]  Chau-Wen Tseng,et al.  An Overview of the SUIF Compiler for Scalable Parallel Machines , 1995, PPSC.

[71]  Mahmut T. Kandemir,et al.  Optimizing code parallelization through a constraint network based approach , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[72]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[73]  Radu Marculescu,et al.  Energy- and performance-aware mapping for regular NoC architectures , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[74]  Edwin V. Bonilla,et al.  Predicting best design trade-offs: A case study in processor customization , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[75]  T. Mudge,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[76]  Chih-Ping Chu,et al.  Exploitation of parallelism to nested loops with dependence cycles , 2004, J. Syst. Archit..

[77]  Mahmut T. Kandemir LODS: locality-oriented dynamic scheduling for on-chip multiprocessors , 2004, Proceedings. 41st Design Automation Conference, 2004..

[78]  Saumya K. Debray,et al.  Profile-guided code compression , 2002, PLDI '02.

[79]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[80]  Mahmut T. Kandemir,et al.  Integer linear programming based energy optimization for banked DRAMs , 2005, GLSVLSI '05.

[81]  Nikil D. Dutt,et al.  Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[82]  Bo Hu,et al.  Multilevel expansion-based VLSI placement with blockages , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[83]  Mahmut T. Kandemir,et al.  Exploiting shared scratch pad memory space in embedded multiprocessor systems , 2002, DAC '02.

[84]  Wayne H. Wolf The future of multiprocessor systems-on-chips , 2004, Proceedings. 41st Design Automation Conference, 2004..

[85]  Hui Li,et al.  Locality and Loop Scheduling on NUMA Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[86]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[87]  Monica S. Lam,et al.  Automatic computation and data decomposition for multiprocessors , 1997 .

[88]  Brian Parker Tunstall,et al.  Synthesis of noiseless compression codes , 1967 .

[89]  Mahmut T. Kandemir,et al.  Data compression for improving SPM behavior , 2004, Proceedings. 41st Design Automation Conference, 2004..

[90]  Wei Li,et al.  Compiling for NUMA Parallel Machines , 1993 .

[91]  Cheng Wang,et al.  Impact of data compression on energy consumption of wireless-networked handheld devices , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[92]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[93]  Enrico Macii,et al.  A new algorithm for energy-driven data compression in VLIW embedded processors , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[94]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[95]  Rudy Lauwereins,et al.  Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling , 2003, DATE.

[96]  Mahmut T. Kandemir,et al.  Integrating loop and data optimizations for locality within a constraint network based framework , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[97]  Keshav Pingali,et al.  Access normalization: loop restructuring for NUMA computers , 1993, TOCS.

[98]  William J. Dally,et al.  Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.

[99]  David A. Padua,et al.  Compiler Techniques for the Distribution of Data and Computation , 2003, IEEE Trans. Parallel Distributed Syst..

[100]  Mahmut T. Kandemir,et al.  Code Scheduling for Optimizing Parallelism and Data Locality , 2010, Euro-Par.

[101]  Mahmut T. Kandemir,et al.  Compiler-Directed Code Restructuring for Operating with Compressed Arrays , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[102]  Heonshik Shin,et al.  Scratchpad memory management for portable systems with a memory management unit , 2006, EMSOFT '06.

[103]  Tarek S. Abdelrahman,et al.  Automatic partitioning of data and computations on scalable shared memory multiprocessors , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[104]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[105]  Isabelle Puaut,et al.  Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison , 2007 .

[106]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[107]  Carla Schlatter Ellis,et al.  Power aware page allocation , 2000, SIGP.

[108]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[109]  Xiaowei Shen,et al.  Hardware Compressed Main Memory: Operating System Support and Performance Evaluation , 2001, IEEE Trans. Computers.

[110]  Edith Schonberg,et al.  Factoring: a method for scheduling parallel loops , 1992 .

[111]  Michael F. P. O'Boyle,et al.  Nonsingular Data Transformations: Definition, Validity, and Applications , 1999, International Journal of Parallel Programming.

[112]  Fernando Gehm Moraes,et al.  Exploring NoC mapping strategies: an energy and timing aware technique , 2005, Design, Automation and Test in Europe.

[113]  Stephen Richardson MPOC: A Chip Multiprocessor for Embedded Systems , 2002 .

[114]  Li-Shiuan Peh,et al.  Design-space exploration of power-aware on/off interconnection networks , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[115]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[116]  Montserrat Ros,et al.  Code compression based on operand-factorization for VLIW processors , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[117]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[118]  Narayanan Vijaykrishnan,et al.  Thermal-aware IP virtualization and placement for networks-on-chip architecture , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[119]  Volodymyr Beletskyy,et al.  An approach to parallelizing non-uniform loops with the Omega calculator , 2002, Proceedings. International Conference on Parallel Computing in Electrical Engineering.

[120]  Gregory R. Andrews,et al.  An adaptive approach to data placement , 1996, Proceedings of International Conference on Parallel Processing.

[121]  Jun Yang,et al.  Frequent value compression in data caches , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[122]  Mahmut Kandemir,et al.  Memory bank aware dynamic loop scheduling , 2007 .

[123]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.