Code restructuring for improving execution efficiency, code size and power consumption for embedded

Many embedded systems such as personal digital assistants (PDAs), cellular phones, etc. involve heavy use of digital signal processing and are thus based on Digital Signal Processors (DSPs). DSPs such as the TMS320C2x and the DSP5600x have irregular data-paths that typically the result of application speciic needs (such as chaining multiply-accumulate operations, etc). EEcient code generation for such embedded DSP processors is a challenging problem because of the additional constraints such as tight memory and low power consumption demands, resulting in the need for compact code. In this work, we address the problem of generating compact and eecient code for embedded DSP processors. Most of the DSP instruction set architectures (ISAs) feature intra instruction parallelism (IIP) enabling individual operations to be executed in parallel with a complex instruction. A reduction in generated code size and improved performance can be achieved by exploiting this parallelism present in such ISAs. In this work, we present a code restructuring technique to fully exploit this parallelism through maximal utilization of the complex instructions present in the DSP instruction set. We formulate this problem as a maximal beneet code restructuring problem, which is to derive the arrangement of statements in a basic block to maximally exploit IIP without violating data dependencies. This problem is equivalent to Precedence Constrained Hamilto-nian Path Problem for DAGs and the Traveling Salesman Problem (PTSP) in general, both of which are NP-hard. A heuristic is then presented to solve the problem. We have implemented this heuristic in the SPAM compiler targeted to generate code for the TMS320C25 DSP. We tested our framework on a number of benchmarks and found that the performance of the generated code|measured in dynamic instruction cycle counts|improves by as much as 7% with an average of 3%. The average code size reduction over code compiled without exploiting parallelism is 2.9%. We also studied the eeect of loop unrolling on the available IIP within a basic block. The improvement in static code size can be traded oo with a reduction in the number of dynamic instruction cycle counts through loop unrolling techniques. Finally we measure power eeciency of the generated code and show that not only is code size reduced but also power consumption is signiicantly reduced. 1 Introduction Embedded processors are widely used in a variety of applications such as cellular phones, pagers, printers, copiers, digital cameras, automobiles, ight navigation systems etc. Unlike general purpose processors, embedded …

[1]  Stan Y. Liao,et al.  Code generation and optimization for embedded digital signal processors , 1996 .

[2]  Richard Gerber,et al.  Guaranteeing Real-Time Requirements With Resource-Based Calibration of Periodic Processes , 1995, IEEE Trans. Software Eng..

[3]  Chi-Ying Tsui,et al.  Low power architecture design and compilation techniques for high-performance processors , 1994, Proceedings of COMPCON '94.

[4]  Jennifer Eyre,et al.  DSP Processors Hit the Mainstream , 1998, Computer.

[5]  Kurt Keutzer,et al.  Storage assignment to decrease code size , 1996, TOPL.

[6]  Jack W. Davidson,et al.  Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation , 1995, MICRO.

[7]  David B. Whalley,et al.  Decreasing process memory requirements by overlapping program portions , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[8]  Rajeev Motwani,et al.  Constrained TSP and Low-Power Computing , 1997, WADS.

[9]  Kiyoung Choi,et al.  Power-conscious high level synthesis using loop folding , 1997, DAC.

[10]  Manfred Schlett Trends in Embedded-Microprocessor Design , 1998, Computer.

[11]  Rajiv Gupta,et al.  Region Scheduling: An Approach for Detecting and Redistributing Parallelism , 1990, IEEE Trans. Software Eng..

[12]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[13]  Santosh Pande,et al.  An Efficient Data Partitioning Method for Limited Memory Embedded Systems , 1998, LCTES.

[14]  L. Bianco,et al.  Exact And Heuristic Procedures For The Traveling Salesman Problem With Precedence Constraints, Based On Dynamic Programming , 1994 .

[15]  Guido Araujo,et al.  Code generation algorithms for digital signal processors , 1997 .

[16]  David B. Whalley,et al.  Avoiding conditional branches by code replication , 1995, PLDI '95.

[17]  Thomas R. Gross,et al.  Avoidance and suppression of compensation code in a trace scheduling compiler , 1994, TOPL.

[18]  Kurt Keutzer,et al.  Code density optimization for embedded DSP processors using data compression techniques , 1995, Proceedings Sixteenth Conference on Advanced Research in VLSI.

[19]  Sharad Malik,et al.  Optimal code generation for embedded memory non-homogeneous register architectures , 1995 .

[20]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[21]  Amit Rao,et al.  Storage assignment optimizations to generate compact and efficient code on embedded DSPs , 1999, PLDI '99.

[22]  Srinivas Devadas,et al.  Analysis and Evaluation of Address Arithmetic Capabilities in Custom DSP Architectures , 1997, Des. Autom. Embed. Syst..

[23]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[24]  Yanhong A. Liu,et al.  Automatic Accurate Time-Bound Analysis for High-Level Languages , 1998, LCTES.

[25]  Kurt Keutzer,et al.  Code density optimization for embedded DSP processors using data compression techniques , 1998, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[26]  Sharad Malik,et al.  Challenges in code generation for embedded processors , 1994, Code Generation for Embedded Processors.

[27]  Peter Marwedel,et al.  Code generation for embedded processors: an introduction , 1994, Code Generation for Embedded Processors.

[28]  Kurt Keutzer,et al.  Instruction selection using binate covering for code size optimization , 1995, ICCAD.

[29]  Sharad Malik,et al.  Using register-transfer paths in code generation for heterogeneous memory-register architectures , 1996, DAC '96.

[30]  Keith D. Cooper,et al.  Non-local Instruction Scheduling with Limited Code Growth , 1998, LCTES.