Optimizing explicit data transfers for data parallel applications on the cell architecture
暂无分享,去创建一个
[1] Lian Li,et al. Barrier synchronization for CELL multi-processor architecture , 2008, 2008 First IEEE International Conference on Ubi-Media Computing.
[2] Fabrizio Petrini,et al. Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.
[3] Ashok Srinivasan,et al. Optimizing assignment of threads to SPEs on the cell BE processor , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[4] P. Hanrahan,et al. Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).
[5] Michael Wolfe,et al. More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).
[6] Kathryn S. McKinley,et al. Tile size selection using cache organization and data layout , 1995, PLDI '95.
[7] Anant Agarwal,et al. Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..
[8] Paul M. Carpenter,et al. Buffer Sizing for Self-timed Stream Programs on Heterogeneous Distributed Memory Multiprocessors , 2010, HiPEAC.
[9] Jean-Loup Baer,et al. A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.
[10] Sanguthevar Rajasekaran,et al. SPENK: adding another level of parallelism on the cell broadband engine , 2008, IFMT '08.
[11] Benjamin Rose,et al. A comparison of programming models for multiprocessors with explicitly managed memory hierarchies , 2009, PPoPP '09.
[12] Monica S. Lam,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.
[13] Alexander V. Veidenbaum,et al. An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1 , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.
[14] Kathryn M. O'Brien,et al. Optimizing the Use of Static Buffers for DMA on a CELL Chip , 2006, LCPC.
[15] Christian Zinner,et al. ROS-DMA: A DMA Double Buffering Method for Embedded Image Processing with Resource Optimized Slicing , 2006, 12th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS'06).
[16] Kathryn S. McKinley,et al. Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.
[17] Jason Fritts. Multi-level memory prefetching for media and stream processing , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.
[18] Jean-Loup Baer,et al. Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.
[19] Dimitrios S. Nikolopoulos,et al. Programming Multiprocessors with Explicitly Managed Memory Hierarchies , 2009, Computer.
[20] H. Nussbaumer. Fast Fourier transform and convolution algorithms , 1981 .
[21] Darren J. Kerbyson,et al. Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[22] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.
[23] Karim Esseghir. Improving data locality for caches , 1993 .
[24] Jordi Torres,et al. CellMT: A cooperative multithreading library for the Cell/B.E. , 2009, 2009 International Conference on High Performance Computing (HiPC).
[25] Fabrizio Petrini,et al. Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[26] Michael Gschwind. The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor , 2007, International Journal of Parallel Programming.
[27] Xizhou Feng,et al. Modeling Multigrain Parallelism on Heterogeneous Multi-core Processors: A Case Study of the Cell BE , 2008, HiPEAC.
[28] Ken Kennedy,et al. Software prefetching , 1991, ASPLOS IV.
[29] Anoop Gupta,et al. Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..
[30] J.C. Sancho,et al. Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications , 2006, ACM/IEEE SC 2006 Conference (SC'06).
[31] Alexander V. Veidenbaum,et al. An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1 , 2004, International Journal of Parallel Programming.
[32] Ramesh Subramonian,et al. LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.
[33] H. Peter Hofstee,et al. Key features of the design methodology enabling a multi-core SoC implementation of a first-generation CELL processor , 2006, Asia and South Pacific Conference on Design Automation, 2006..
[34] Bowen Alpern,et al. A model for hierarchical memory , 1987, STOC.
[35] Michel Dubois,et al. Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..