Exploring a Novel Gathering Method for Finite Element Codes on the Cell/B.E. Architecture

Indirect addressing is known for being slow on conventional architectures, due to the extra step of gathering data before computations can be done. There have been proposed many methods for optimizing indirect addressing. However, these almost exclusively, merely try to change the order in which data is accessed, so as to better utilize the cache. Furthermore, vector instructions can not be used, since data is not accessed continuously, and therefore valuable processing power can not be exploited. The Cell/B.E. architecture has multiple powerful DMA engines, suitable for gathering scattered data. Unfortunately, at fine data granularity, they have several constraints which make them inefficient. In this paper, a novel solution called DMA list Interlacing (DLI) is explored, which overcomes the DMA constraints and enables the usage of vector instructions, without any extra effort. It is shown that DLI can achieve speedups of several orders of magnitude, compared to conventional processors.

[1]  Guillaume Houzeaux,et al.  Porting to Cell/B.E. the Alya System, a High Performance Computational Mechanics Code , 2010 .

[2]  G. Fasshauer Meshfree Methods , 2004 .

[3]  Vipin Kumar,et al.  Multilevel k-way hypergraph partitioning , 1999, DAC '99.

[4]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[5]  Guohua Jin,et al.  Using Space-filling Curves for Computation Reordering , 2005 .

[6]  Martin J. Dürst,et al.  The design and analysis of spatial data structures. Applications of spatial data structures: computer graphics, image processing, and GIS , 1991 .

[7]  Ken Kennedy,et al.  Improving memory hierarchy performance for irregular applications , 1999, ICS '99.

[8]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[9]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[10]  Shahid H. Bokhari,et al.  A Partitioning Strategy for PDEs Across Multiprocessors , 1985, ICPP.

[11]  Juan J. Navarro,et al.  Data prefetching and multilevel blocking for linear algebra operations , 1996, ICS '96.

[12]  Shahid H. Bokhari,et al.  A Partitioning Strategy for Nonuniform Problems on Multiprocessors , 1987, IEEE Transactions on Computers.

[13]  Miriam Mehl,et al.  A Cache-Aware Algorithm for PDEs on Hierarchical Data Structures Based on Space-Filling Curves , 2006, SIAM J. Sci. Comput..

[14]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[15]  Chau-Wen Tseng,et al.  A Comparison of Locality Transformations for Irregular Codes , 2000, LCR.

[16]  John Robinson,et al.  Introduction to the S-adaptivity method , 1997 .

[17]  Daniel Jiménez-González,et al.  Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[18]  Michael S. Warren,et al.  A parallel hashed oct-tree N-body algorithm , 1993, Supercomputing '93. Proceedings.

[19]  Chau-Wen Tseng,et al.  Improving Locality for Adaptive Irregular Scientific Codes , 2000, LCPC.

[20]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[21]  Scott M. Murman,et al.  Applications of Space-Filling-Curves to Cartesian Methods for CFD , 2004 .

[22]  David R. O'Hallaron,et al.  Languages, Compilers and Run-Time Systems for Scalable Computers , 1998, Springer US.

[23]  Sanjay Ranka,et al.  Memory hierarchy management for iterative graph structures , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[24]  Samuel P. Midkiff,et al.  Efficient high performance collective communication for the cell blade , 2009, ICS '09.

[25]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[26]  Toni Cortes,et al.  PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[27]  KennedyKen,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999 .

[28]  Vipin Kumar,et al.  Multilevel k-way Hypergraph Partitioning , 2000, VLSI Design.