Optimal 2D Data Partitioning for DMA Transfers on MPSoCs

Reducing the effects of off-chip memory access latency is a key factor in exploiting efficiently embedded multicore platforms. We consider architectures that admit a multi-core computation fabric, having its own fast and small memory to which the data blocks to be processed are fetched from external memory using a DMA (direct memory access) engine, employing a double- or multiple-buffering scheme to avoid processor idling. In this paper we focus on application programs that process two dimensional data arrays and we determine automatically the size and shape of the portions of the data array which are subject to a single DMA call, based on hardware and applications parameters. When the computation on different array elements are completely independent, the asymmetry of memory structure leads always to prefer one-dimensional horizontal pieces of memory, while when the computation of a data element shares some data with its neighbors, there is a pressure for more "square" shapes to reduce the amount of redundant data transfers. We provide an analytic model for this optimization problem and validate our results by running a mean filter application on the CELL simulator.

[1]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[2]  Anant Agarwal,et al.  Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[3]  Selma Saidi,et al.  Optimizing explicit data transfers for data parallel applications on the cell architecture , 2012, TACO.

[4]  Mounir Hamdi,et al.  Parallel Image Processing Applications on a Network of Workstations , 1995, Parallel Comput..

[5]  Benjamin Rose,et al.  A comparison of programming models for multiprocessors with explicitly managed memory hierarchies , 2009, PPoPP '09.

[6]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[7]  Michael Gschwind The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor , 2007, International Journal of Parallel Programming.

[8]  Darren J. Kerbyson,et al.  Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[9]  Hermann Kopetz,et al.  The time-triggered architecture , 1998, Proceedings First International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC '98).

[10]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[11]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[12]  J.C. Sancho,et al.  Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[13]  Dimitrios S. Nikolopoulos,et al.  Programming Multiprocessors with Explicitly Managed Memory Hierarchies , 2009, Computer.

[14]  D. Altilar,et al.  Minimum Overhead Data Partitioning Algorithms for Parallel Video Processing , 2001 .