Custom data layout for memory parallelism

We describe a generalized approach to deriving a custom data layout in multiple memory banks for array-based computations, to facilitate high-bandwidth parallel memory accesses in modern architectures, where multiple memory banks can simultaneously feed one or more functional units. We do not use a fixed data layout, but rather select application-specific layouts according to access patterns in the code. A unique feature of this approach is its flexibility in the presence of code reordering transformations, such as the loop nest transformations commonly applied to array-based computations. We have implemented this algorithm in the DEFACTO system, a design environment for automatically mapping C programs to hardware implementations for FPGA-based systems. We present experimental results for five multimedia kernels that demonstrate the benefits of this approach. Our results show that custom data layout yields results as good as, or better than, naive or fixed cyclic layouts, and is significantly better for certain access patterns and in the presence of code reordering transformations. When used in conjunction with unrolling loops in a nest to expose instruction-level parallelism, we observe greater than a 75% reduction in the number of memory access cycles and speedups ranging from 3.96 to 46.7 for 8 memories, as compared to using a single memory with no unrolling.

[1]  Maya Gokhale,et al.  NAPA C: compiling for a hybrid RISC/FPGA architecture , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[2]  Wayne Luk,et al.  Memory Access Optimization and RAM Inference for Pipeline Vectorization , 1999, FPL.

[3]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[4]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[5]  Daniel M. Lavery,et al.  Optimizations to prevent cache penalties for the Intel/spl reg/ Itanium/spl reg/ 2 processor , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[6]  Francky Catthoor,et al.  Fast and extensive system-level memory exploration for ATM applications , 1997, Proceedings. Tenth International Symposium on System Synthesis (Cat. No.97TB100114).

[7]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[8]  Frank Pfenning,et al.  A type theory for memory allocation and data layout , 2003, POPL '03.

[9]  Ken Kennedy,et al.  Automatic Data Layout for High Performance Fortran , 1995, SC.

[10]  Praveen K. Murthy,et al.  Buffer merging—a powerful technique for reducing memory requirements of synchronous dataflow specifications , 2004, TODE.

[11]  Saman P. Amarasinghe,et al.  Maps: a compiler-managed memory system for Raw machines , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[12]  Dror Rawitz,et al.  The hardness of cache conscious data placement , 2002, POPL '02.

[13]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[14]  Tom Keller,et al.  Tera-Op Reliable Intelligently Adaptive Processing System (TRIPS) , 2004 .

[15]  Santosh Pande,et al.  A framework for parallelizing load/stores on embedded processors , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[16]  Csaba Andras Moritz,et al.  Parallelizing applications into silicon , 1999, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00375).

[17]  Viktor K. Prasanna,et al.  Latin Squares for Parallel Array Access , 1993, IEEE Trans. Parallel Distributed Syst..

[18]  Monica S. Lam,et al.  Automatic computation and data decomposition for multiprocessors , 1997 .

[19]  Jean-Francois Collard,et al.  Optimizations to prevent cache penalties for the Intel® Itanium® 2 Processor , 2003, CGO.

[20]  Sally A. McKee,et al.  Design of a parallel vector access unit for SDRAM memory systems , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[21]  Pedro C. Diniz,et al.  A compiler approach to fast hardware design space exploration in FPGA-based systems , 2002, PLDI '02.

[22]  Manish Gupta,et al.  Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers , 1992, IEEE Trans. Parallel Distributed Syst..

[23]  Rastislav Bodík,et al.  An efficient profile-analysis framework for data-layout optimizations , 2002, POPL '02.

[24]  Herman Schmit,et al.  Address generation for memories containing multiple arrays , 1998, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[25]  Nikil D. Dutt,et al.  Access pattern based local memory customization for low power embedded systems , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[26]  Maurice V. Wilkes,et al.  The memory wall and the CMOS end-point , 1995, CARN.

[27]  Maya Gokhale,et al.  Automatic allocation of arrays to memories in FPGA processors with multiple memory banks , 1999, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00375).

[28]  Sally A. McKee,et al.  Access ordering and memory-conscious cache utilization , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[29]  Praveen K. Murthy,et al.  A buffer merging technique for reducing memory requirements of synchronous dataflow specifications , 1999, Proceedings 12th International Symposium on System Synthesis.