A framework for parallelizing load/stores on embedded processors

Many modern embedded processors (esp. DSPs) support partitioned memory banks (also called X-Y memory or dual bank memory) along with parallel load/store instructions to achieve code density and/or performance. In order to effectively utilize the parallel load/store instructions, the compiler must partition the memory resident values into X or Y bank. This paper gives a post-register allocation solution to merge the generated load/store instructions into their parallel counter-parts. Simultaneously, our framework performs allocation of values to X or Y memory banks. We first remove as many load/stores and register-register moves through an excellent iterated coalescing based register allocator by Appel and George. We then attempt to maximally parallelize the generated load/stores using a multi-pass approach with minimal growth in terms of memory requirements. The first phase of our approach attempts the merger of load stores without replication of values in memory. We model this problem in terms of a graph coloring problem in which each value is colored X or Y. We then construct a Motion Scheduling Graph (MSG) based on the range of motion for each load/store instruction. MSG reflects potential instructions which could be merged. We propose a notion of pseudo-fixed boundaries so that the load/store movement is minimally affected by register dependencies. We prove that the coloring problem for MSG is NP-complete. We then propose a heuristic solution, which minimally replicates load/stores on different control flow paths if necessary. Finally, the remaining load/stores are tackled by register rematerialization and local conflicts are eliminated Registers are re-assigned to create motion ranges if opportunities are found for merger which are hindered by local assignment of registers. We show that our framework results in parallelization of a large number of load/stores without much growth in data and code segments.

[1]  Sharad Malik,et al.  Simultaneous reference allocation in code generation for dual data memory bank ASIPs , 2000, TODE.

[2]  Marvin V. Zelkowitz,et al.  Programming Languages: Design and Implementation , 1975 .

[3]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[4]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[5]  Robert Azencott,et al.  Simulated annealing : parallelization techniques , 1992 .

[6]  Yunheung Paek,et al.  Efficient register and memory assignment for non-orthogonal architectures via graph coloring and MST algorithms , 2002, LCTES/SCOPES '02.

[7]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[8]  Jack W. Davidson,et al.  Memory access coalescing: a technique for eliminating redundant memory accesses , 1994, PLDI '94.

[9]  Ehl Emile Aarts,et al.  Simulated annealing and Boltzmann machines , 2003 .

[10]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[11]  Rainer Leupers,et al.  Variable partitioning for dual memory bank DSPs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12]  Paul Chow,et al.  Exploiting dual data-memory banks in digital signal processors , 1996, ASPLOS VII.

[13]  Andrew W. Appel,et al.  Iterated register coalescing , 1996, POPL '96.

[14]  Keith D. Cooper,et al.  Compiler-controlled memory , 1998, ASPLOS VIII.

[15]  Keith D. Cooper,et al.  Enhanced code compression for embedded RISC processors , 1999, PLDI '99.

[16]  Emile H. L. Aarts,et al.  Simulated annealing and Boltzmann machines - a stochastic approach to combinatorial optimization and neural computing , 1990, Wiley-Interscience series in discrete mathematics and optimization.

[17]  Andrew W. Appel,et al.  Optimal spilling for CISC machines with few registers , 2001, PLDI '01.

[18]  Edward A. Lee,et al.  Direct synthesis of optimized DSP assembly code from signal flow block diagrams , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Ashish Zumbarlal Bhalgat INSTRUCTION SCHEDULING TO HIDE LOAN/STORE LATENCY IN IRREGULAR ARCHITECTURE EMBEDDED PROCESSORS , 2001 .

[20]  Sharad Malik,et al.  Memory bank and register allocation in software synthesis for ASIPs , 1995, ICCAD.

[21]  Dsp Division,et al.  DSP56600 16-bit Digital Signal Processor Family Manual , 1996 .