Automatic Prefetch and Modulo Scheduling Transformations for the Cell BE Architecture

Ease of programming is one of the main requirements for the broad acceptance of multicore systems without hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that targets enabling prefetch techniques. Memory accesses are classified at compile time into two classes: high locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software-cache overhead in the innermost loop. The cache design enables automatic prefetch and modulo scheduling transformations. Performance evaluation indicates that optimized software-cache structures combined with the proposed prefetch techniques translate into speedup between 10 and 20 percent. As a result of the proposed technique, we can achieve similar performance on the Cell BE processor as on a modern server-class multicore such as the IBM PowerPC 970MP processor for a set of parallel NAS applications.

[1]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[2]  Erik Brockmeyer,et al.  A combined DMA and application-specific prefetching approach for tackling the memory latency bottleneck , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[4]  Tao Zhang,et al.  Prefetching irregular references for software cache on cell , 2008, CGO '08.

[5]  Robert A. Walker,et al.  Interrupt Triggered Software Prefetching for Embedded CPU Instruction Cache , 2006, 12th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS'06).

[6]  Jack Dongarra,et al.  Introduction to the HPCChallenge Benchmark Suite , 2004 .

[7]  Bronis R. de Supinski,et al.  The OpenMP Memory Model , 2005, IWOMP.

[8]  Tao Zhang,et al.  Orchestrating data transfer for the cell/B.E. processor , 2008, ICS '08.

[9]  Yale N. Patt,et al.  An effective programmable prefetch engine for on-chip caches , 1995, MICRO 1995.

[10]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[11]  B. R. Rau,et al.  Code generation schema for modulo scheduled loops , 1992, MICRO 1992.

[12]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[13]  Eduard Ayguadé,et al.  Hybrid access-specific software cache techniques for the cell BE architecture , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Michael Gschwind,et al.  Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture , 2006, IBM Syst. J..

[15]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[16]  Eduard Ayguadé,et al.  Automatic Prefetch and Modulo Scheduling Transformations for the Cell BE Architecture , 2010 .

[17]  Wen-mei W. Hwu,et al.  Modulo scheduling of loops in control-intensive non-numeric programs , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[18]  Yunheung Paek,et al.  Efficient and precise array access analysis , 2002, TOPL.

[19]  Tien-Fu Chen,et al.  Alternative implementations of hybrid branch predictors , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[20]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[21]  Martin Hopkins,et al.  A novel SIMD architecture for the cell heterogeneous chip-multiprocessor , 2005, 2005 IEEE Hot Chips XVII Symposium (HCS).

[22]  B. Ramakrishna Rau,et al.  Code generation schema for modulo scheduled loops , 1992, MICRO.