Array scalarization in high level synthesis

Parallelism across loop iterations present in behavioral specifications can typically be exposed and optimized using well known techniques such as Loop Unrolling. However, since behavioral arrays are usually mapped to memories (SRAM) during synthesis, performance bottlenecks arise due to memory port constraints. We study array scalarization, the transformation of an array into a group of scalar variables. We propose a technique for selectively scalarizing arrays for improving the performance of synthesized designs by taking into consideration the latency benefits as well as the area overhead caused by using discrete registers for storing array elements instead of denser SRAM. Our experiments on several benchmark examples indicate promising speedups of more than 10x for several designs due to scalarization.

[1]  Mahmut T. Kandemir,et al.  Influence of compiler optimizations on system power , 2001, IEEE Trans. Very Large Scale Integr. Syst..

[2]  Preeti Ranjan Panda,et al.  Customization of Register File Banking Architecture for Low Power , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[3]  Yosi Ben-Asher,et al.  Automatic Memory Partitioning: Increasing memory parallelism via data structure partitioning , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[4]  Erik Brockmeyer,et al.  Data and memory optimization techniques for embedded systems , 2001, TODE.

[5]  Jason Cong,et al.  Memory partitioning for multidimensional arrays in high-level synthesis , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[6]  Francky Catthoor,et al.  Custom Memory Management Methodology , 1998, Springer US.

[7]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[8]  Nikil D. Dutt,et al.  Low power mapping of behavioral arrays to multiple memories , 1996, Proceedings of 1996 International Symposium on Low Power Electronics and Design.

[9]  Vivek Sarkar Optimized unrolling of nested loops , 2000, ICS '00.

[10]  Jason Cong,et al.  Memory partitioning and scheduling co-optimization in behavioral synthesis , 2012, 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[11]  Michael F. P. O'Boyle,et al.  Integrating Loop and Data Transformations for Global Optimization , 2002, J. Parallel Distributed Comput..

[12]  Cyrille Chavet,et al.  A memory mapping approach for parallel interleaver design with multiples read and write accesses , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[13]  Yusuf Leblebici,et al.  Memory organization and data layout for instruction set extensions with architecturally visible storage , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[14]  Peter Marwedel,et al.  Source Code Optimization Techniques for Data Flow Dominated Embedded Software , 2004, Springer US.

[15]  Preeti Ranjan Panda,et al.  The Impact of Loop Unrolling on Controller Delay in High Level Synthesis , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[16]  Taewhan Kim,et al.  An integrated algorithm for memory allocation and assignment in high-level synthesis , 2002, DAC '02.

[17]  Ahmed Amine Jerraya,et al.  An optimal memory allocation for application-specific multiprocessor system-on-chip , 2001, International Symposium on System Synthesis (IEEE Cat. No.01EX526).

[18]  Michael F. P. O'Boyle,et al.  Integrating loop and data transformations for global optimisation , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[19]  Jason Cong,et al.  An integrated and automated memory optimization flow for FPGA behavioral synthesis , 2012, 17th Asia and South Pacific Design Automation Conference.

[20]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[21]  Anant Agarwal,et al.  Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[22]  Nikil D. Dutt,et al.  Coordinated parallelizing compiler optimizations and high-level synthesis , 2004, TODE.

[23]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[24]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[25]  Taewhan Kim,et al.  Memory allocation and mapping in high-level synthesis - an integrated approach , 2003, IEEE Trans. Very Large Scale Integr. Syst..