Data Locality Exploitation in the Decomposition of Regular Domain Problems

The aim of this paper is to study the effect of local memory hierarchy and communication network exploitation on message sending and the influence of this effect on the decomposition of regular applications. In particular, we have considered two different parallel computers, a Cray T3E-900 and an SGI Origin 2000. In both systems, the bandwidth reduction due to non-unit-stride memory access is quite significant and could be more important than the reduction due to contention in the network. These conclusions affect the choice of optimal decompositions for regular domains problems. Thus, although traditional 3D decompositions lead to lower inherent communication-to-computation ratios and could exploit more efficiently the interconnection network, lower dimensional decompositions are found to be more efficient due to the data decomposition effects on the spatial locality of the messages to be communicated. This increasing importance of local optimisations has also been shown using a well-known communication-computation overlapping technique which increases execution time, instead of reducing it as we could expect, due to poor cache memory exploitation.

[1]  Prithviraj Banerjee,et al.  Techniques to overlap computation and communication in irregular iterative applications , 1994, ICS '94.

[2]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[3]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[4]  Ian T. Foster,et al.  Designing and building parallel programs - concepts and tools for parallel software engineering , 1995 .

[5]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[6]  Francisco Tirado,et al.  Parallel resolution of alternating-line processes by means of pipelining techniques , 1999, Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99.

[7]  Zhiwei Xu,et al.  Modeling communication overhead: MPI and MPL performance on the IBM SP2 , 1996, IEEE Parallel Distributed Technol. Syst. Appl..

[8]  Mark D. Hill,et al.  Making Network Interfaces Less Peripheral , 1998, Computer.

[9]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[10]  Francisco Tirado,et al.  Partitioning Regular Domains on Modern Parallel Computers , 1998, VECPAR.

[11]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[12]  Michael M. Resch,et al.  Performance of MPI on the CRAY T3E-512 , 1997 .

[13]  Jack J. Dongarra,et al.  Software Libraries for Linear Algebra Computations on High Performance Computers , 1995, SIAM Rev..

[14]  Francisco Tirado,et al.  Solution of alternating-line processes on modern parallel computers , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[15]  Anthony J. G. Hey,et al.  Selected Results from the ParkBench Benchmark , 1996, Euro-Par, Vol. II.

[16]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[17]  Aad J. van der Steen,et al.  A Performance Analysis of the SGI Origin2000 , 1998, VECPAR.

[18]  Ulrich Rüde,et al.  Iterative Algorithms on High Performance Architectures , 1997, Euro-Par.

[19]  Francisco Tirado,et al.  Distributed parallel computers versus PVM on a workstation cluster in the simulation of time dependent partial differential equations , 1995, Proceedings Euromicro Workshop on Parallel and Distributed Processing.

[20]  Anthony J. G. Hey,et al.  Message-Passing Performance of Parallel Computers , 1997, Euro-Par.

[21]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[22]  Francisco Tirado,et al.  Message Passing Evaluation and Analysis on Cray T3E and SGI Origin 2000 Systems , 1999, Euro-Par.

[23]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[24]  Agustin Arruabarrena,et al.  Parallel architectures: Assessing the performance of the new IBM SP2 communication subsystem , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[25]  Ken Kennedy,et al.  GIVE-N-TAKE—a balanced code placement framework , 1994, PLDI '94.

[26]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[27]  Nenad Nedeljkovic,et al.  Data distribution support on distributed shared memory multiprocessors , 1997, PLDI '97.

[28]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[29]  Francisco Tirado,et al.  Impact of PE Mapping on Cray T3E Message-Passing Performance , 2000, Euro-Par.

[30]  Francisco Tirado,et al.  Relationships Between Efficiency and Execution Time of Full Multigrid Methods on Parallel Computers , 1997, IEEE Trans. Parallel Distributed Syst..

[31]  E. Anderson,et al.  Performance of the CRAY T3E Multiprocessor , 1997, ACM/IEEE SC 1997 Conference (SC'97).