Optimizing explicit data transfers for data parallel applications on the cell architecture

In this paper we investigate a general approach to automate some deployment decisions for a certain class of applications on multi-core computers. We consider data-parallelizable programs that use the well-known double buffering technique to bring the data from the off-chip slow memory to the local memory of the cores via a DMA (direct memory access) mechanism. Based on the computation time and size of elementary data items as well as DMA characteristics, we derive optimal and near optimal values for the number of blocks that should be clustered in a single DMA command. We then extend the results to the case where a computation for one data item needs some data in its neighborhood. In this setting we characterize the performance of several alternative mechanisms for data sharing. Our models are validated experimentally using a cycle-accurate simulator of the Cell Broadband Engine architecture.

[1]  Lian Li,et al.  Barrier synchronization for CELL multi-processor architecture , 2008, 2008 First IEEE International Conference on Ubi-Media Computing.

[2]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[3]  Ashok Srinivasan,et al.  Optimizing assignment of threads to SPEs on the cell BE processor , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[5]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[6]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[7]  Anant Agarwal,et al.  Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[8]  Paul M. Carpenter,et al.  Buffer Sizing for Self-timed Stream Programs on Heterogeneous Distributed Memory Multiprocessors , 2010, HiPEAC.

[9]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[10]  Sanguthevar Rajasekaran,et al.  SPENK: adding another level of parallelism on the cell broadband engine , 2008, IFMT '08.

[11]  Benjamin Rose,et al.  A comparison of programming models for multiprocessors with explicitly managed memory hierarchies , 2009, PPoPP '09.

[12]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[13]  Alexander V. Veidenbaum,et al.  An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1 , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[14]  Kathryn M. O'Brien,et al.  Optimizing the Use of Static Buffers for DMA on a CELL Chip , 2006, LCPC.

[15]  Christian Zinner,et al.  ROS-DMA: A DMA Double Buffering Method for Embedded Image Processing with Resource Optimized Slicing , 2006, 12th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS'06).

[16]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[17]  Jason Fritts Multi-level memory prefetching for media and stream processing , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[18]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[19]  Dimitrios S. Nikolopoulos,et al.  Programming Multiprocessors with Explicitly Managed Memory Hierarchies , 2009, Computer.

[20]  H. Nussbaumer Fast Fourier transform and convolution algorithms , 1981 .

[21]  Darren J. Kerbyson,et al.  Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[22]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[23]  Karim Esseghir Improving data locality for caches , 1993 .

[24]  Jordi Torres,et al.  CellMT: A cooperative multithreading library for the Cell/B.E. , 2009, 2009 International Conference on High Performance Computing (HiPC).

[25]  Fabrizio Petrini,et al.  Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[26]  Michael Gschwind The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor , 2007, International Journal of Parallel Programming.

[27]  Xizhou Feng,et al.  Modeling Multigrain Parallelism on Heterogeneous Multi-core Processors: A Case Study of the Cell BE , 2008, HiPEAC.

[28]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[29]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[30]  J.C. Sancho,et al.  Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[31]  Alexander V. Veidenbaum,et al.  An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1 , 2004, International Journal of Parallel Programming.

[32]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[33]  H. Peter Hofstee,et al.  Key features of the design methodology enabling a multi-core SoC implementation of a first-generation CELL processor , 2006, Asia and South Pacific Conference on Design Automation, 2006..

[34]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[35]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..