Compact and efficient code generation through program restructuringon limited memory embedded DSPs

Many embedded systems such as digital cameras, digital radios, high-resolution printers, cellular phones, etc., involve a heavy use of signal processing and are thus based on digital signal processors (DSPs). DSPs such as the TMS320C2x and the DSP5600x have irregular data paths that typically result due to application specific needs (such as chaining multiply-accumulate operations, etc.). Efficient code generation for such embedded DSP processors is a challenging problem. The stringent requirements such as tight memory constraints and fast response time result in the need for a compact and efficient code. In this paper, we address the problem of generating a compact and efficient code for embedded DSP processors. Most of the DSP instruction set architectures (ISAs) feature intrainstruction parallelism (IIP), enabling individual operations to be executed in parallel by generating a complex instruction. A reduction in generated code size and improved performance can be achieved by exploiting this parallelism present in such ISAs. In this paper, we present a code restructuring technique to fully exploit this parallelism through maximal utilization of the complex instructions present in the instruction set. We formulate this as a maximal benefit code restructuring problem, which is to derive the arrangement of statements to maximally exploit IIP without violating data dependencies. This problem is equivalent to the precedence constrained Hamiltonian path problem for directed acyclic graphs and the traveling salesman problem in general, both of which are NP-hard. In this paper, we present an optimal algorithm to solve the problem. We have implemented this optimal algorithm in a compiler targeted to generate code for the TMS320C25 DSP. We tested our framework on a number of benchmarks and found that the performance of the generated code (measured in dynamic instruction cycle counts) improves by as much as 9.9% with an average of 4%. The average code-size reduction over code compiled without exploiting parallelism is 2.9%. We also studied the effect of loop unrolling on the available IIP. An on-chip instruction cache can be effectively utilized by unrolling loops such that generated code fully occupies the memory. The benefit is reduction in dynamic instruction count due to the higher number of complex instructions generated. We found that by unrolling loop by four times to fit available on-chip instruction cache, the dynamic instruction counts reduce by as much as 9.9 %.

[1]  Jack W. Davidson,et al.  Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation , 1995, MICRO 1995.

[2]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[3]  Nikil D. Dutt,et al.  Memory organization for improved data cache performance in embedded processors , 1996, Proceedings of 9th International Symposium on Systems Synthesis.

[4]  Amit Rao,et al.  Storage assignment optimizations to generate compact and efficient code on embedded DSPs , 1999, PLDI '99.

[5]  Srinivas Devadas,et al.  Analysis and Evaluation of Address Arithmetic Capabilities in Custom DSP Architectures , 1997, Des. Autom. Embed. Syst..

[6]  Jennifer Eyre,et al.  DSP Processors Hit the Mainstream , 1998, Computer.

[7]  Keith D. Cooper,et al.  Compiler-controlled memory , 1998, ASPLOS VIII.

[8]  Bruce D. Shriver,et al.  Local Microcode Compaction Techniques , 1980, CSUR.

[9]  Manfred Schlett Trends in Embedded-Microprocessor Design , 1998, Computer.

[10]  Santosh Pande,et al.  An Efficient Data Partitioning Method for Limited Memory Embedded Systems , 1998, LCTES.

[11]  L. Bianco,et al.  Exact And Heuristic Procedures For The Traveling Salesman Problem With Precedence Constraints, Based On Dynamic Programming , 1994 .

[12]  Christopher W. Fraser,et al.  Code compression , 1997, PLDI '97.

[13]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[14]  Guido Araujo,et al.  Code generation algorithms for digital signal processors , 1997 .

[15]  Rajiv Gupta,et al.  Value prediction in VLIW machines , 1999, ISCA.

[16]  Thomas R. Gross,et al.  Avoidance and suppression of compensation code in a trace scheduling compiler , 1994, TOPL.

[17]  Kurt Keutzer,et al.  Storage assignment to decrease code size , 1995, PLDI '95.

[18]  Kemal Ebcioglu,et al.  VLIW compilation techniques in a superscalar environment , 1994, PLDI '94.

[19]  Peter Marwedel,et al.  Code generation for embedded processors: an introduction , 1994, Code Generation for Embedded Processors.

[20]  Nikil D. Dutt,et al.  Memory data organization for improved cache performance in embedded processor applications , 1997, TODE.

[21]  Kurt Keutzer,et al.  Code density optimization for embedded DSP processors using data compression techniques , 1995, Proceedings Sixteenth Conference on Advanced Research in VLSI.

[22]  Rajiv Gupta,et al.  Region Scheduling: An Approach for Detecting and Redistributing Parallelism , 1990, IEEE Trans. Software Eng..

[23]  Sharad Malik,et al.  Using register-transfer paths in code generation for heterogeneous memory-register architectures , 1996, DAC '96.

[24]  Keith D. Cooper,et al.  Non-local Instruction Scheduling with Limited Code Growth , 1998, LCTES.

[25]  Richard Gerber,et al.  Guaranteeing Real-Time Requirements With Resource-Based Calibration of Periodic Processes , 1995, IEEE Trans. Software Eng..

[26]  Yanhong A. Liu,et al.  Automatic Accurate Time-Bound Analysis for High-Level Languages , 1998, LCTES.

[27]  Kurt Keutzer,et al.  Instruction selection using binate covering for code size optimization , 1995, Proceedings of IEEE International Conference on Computer Aided Design (ICCAD).

[28]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[29]  Sharad Malik,et al.  Challenges in code generation for embedded processors , 1994, Code Generation for Embedded Processors.

[30]  Stephen W. Keckler,et al.  The M-Machine multicomputer , 1995, MICRO 1995.

[31]  David B. Whalley,et al.  Decreasing process memory requirements by overlapping program portions , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.