SAMS multi-layout memory: providing multiple views of data to boost SIMD performance

We propose to bridge the discrepancy between data representations in memory and those favored by the SIMD processor by customizing the low-level address mapping. To achieve this, we employ the extended Single-Affiliation Multiple-Stride (SAMS) parallel memory scheme at an appropriate level in the memory hierarchy. This level of memory provides both Array of Structures (AoS) and Structure of Arrays (SoA) views for the structured data to the processor, appearing to have maintained multiple layouts for the same data. With such multi-layout memory, optimal SIMDization can be achieved. Our synthesis results using TSMC 90nm CMOS technology indicate that the SAMS Multi-Layout Memory system has efficient hardware implementation, with a critical path delay of less than 1ns and moderate hardware overhead. Experimental evaluation based on a modified IBM Cell processor model suggests that our approach is able to decrease the dynamic instruction count by up to 49% for a selection of real applications and kernels. Under the same conditions, the total execution time can be reduced by up to 37%.

[1]  Paul Budnik,et al.  The Organization and Use of Parallel Memories , 1971, IEEE Transactions on Computers.

[2]  David T. Harper,et al.  Conflict-Free Vector Access Using a Dynamic Storage Scheme , 1991, IEEE Trans. Computers.

[3]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  M. Suzuoki,et al.  Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor , 2006, IEEE Journal of Solid-State Circuits.

[5]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[6]  B. Flachs,et al.  A streaming processing unit for a CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[7]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Roger Espasa,et al.  Conflict-free accesses to strided vectors on a banked cache , 2005, IEEE Transactions on Computers.

[9]  Chunyang Gou,et al.  Sams: single-affiliation multiple-stride parallel memory scheme , 2008, MAW '08.

[10]  Eduard Ayguadé,et al.  Conflict-Free Access for Streams in Multimodule Memories , 1995, IEEE Trans. Computers.

[11]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[12]  David T. Harper,et al.  Increased Memory Performance During Vector Accesses Through the use of Linear Address Transformations , 1992, IEEE Trans. Computers.

[13]  J. Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[14]  Juergen Pille,et al.  The Vector Fixed Point Unit of the Synergistic Processor Element of the Cell Architecture Processor , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[15]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[16]  Mateo Valero,et al.  Command vector memory systems: high performance at low cost , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[17]  S.H. Dhong,et al.  A 4.8GHz fully pipelined embedded SRAM in the streaming processor of a CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[18]  James Smith,et al.  A Simulation Study of the CRAY X-MP Memory System , 1986, IEEE Transactions on Computers.

[19]  Khaled Z. Ibrahim,et al.  Implementing Wilson-Dirac operator on the cell broadband engine , 2008, ICS '08.

[20]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[21]  Mateo Valero,et al.  Performance Impact of Unaligned Memory Operations in SIMD Extensions for Video Codec Applications , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[22]  David T. Harper,et al.  Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems , 1991, IEEE Trans. Parallel Distributed Syst..

[23]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[24]  Daehyun Kim,et al.  Architectural support for uniprocessor and multiprocessor active memory systems , 2004, IEEE Transactions on Computers.

[25]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[26]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.