Classifying and alleviating the communication overheads in matrix computations on large-scale NUMA multiprocessors

Abstract Large-scale, shared-memory multiprocessors have non-uniform memory access (NUMA) costs. The high communication cost dominates the source of matrix computations' execution. Memory contention and remote memory access are two major communication overheads on large-scale NUMA multiprocessors. However, previous experiments and discussions focus either on reducing the number of remote memory accesses or on alleviating memory contention overhead. In this paper, we propose a simple but effective processor allocation policy, called rectangular processor allocation, to alleviate both overheads at the same time. The policy divides the matrix elements into a certain number of rectangular blocks, and assigns each processor to compute the results of one rectangular block. This methodology may reduce a lot of unnecessary memory accesses to the memory modules. After running many matrix computations under a realistic memory system simulator, we confirmed that at least one-fourth of the communication overhead may be reduced. Therefore, we conclude that rectangular processor allocation policy performs better than other popular policies, and that the combination of rectangular processor allocation policy with software interleaving data allocation policy is a better choice to alleviate communication overhead.

[1]  L.M. Ni,et al.  Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers , 1993, IEEE Trans. Parallel Distributed Syst..

[2]  Ricardo Bianchini,et al.  The MIT Alewife machine: architecture and performance , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[3]  Guy Lemieux,et al.  The NUMAchine multiprocessor , 2000, Proceedings 2000 International Conference on Parallel Processing.

[4]  Ricardo Bianchini,et al.  Software interleaving , 1994, Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing.

[5]  Yi-Min Wang,et al.  Clustered affinity scheduling on large-scale NUMA multiprocessors , 1997, J. Syst. Softw..

[6]  J. Ramanujam,et al.  Tiling of Iteration Spaces for Multicomputers , 1990, ICPP.

[7]  Yi-Min Wang,et al.  A Minimal Synchronization Overhead Affinity Scheduling Algorithm for Shared-Memory Multiprocessors , 1995, Int. J. High Speed Comput..

[8]  Evangelos P. Markatos,et al.  Shared memory vs. message passing in shared-memory multiprocessors , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[9]  Hui Li,et al.  Locality and Loop Scheduling on NUMA Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[10]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[11]  Jack E. Veenstra,et al.  Mint Tutorial and User Manual , 1993 .

[12]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[13]  E. P. Markatos,et al.  Shared-Memory Multiprocessor Trends and the Implications for Parallel Program Performance , 1992 .

[14]  Robert J. Fowler,et al.  MINT: a front end for efficient simulation of shared-memory multiprocessors , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[15]  Multiprocessors Using Processor A � nity in Loop Scheduling on Shared Memory , 1994 .

[16]  Michael Stumm,et al.  Hector - A Hierarchically Structured , 1991 .

[17]  Anoop Gupta,et al.  The DASH prototype: implementation and performance , 1992, ISCA '92.

[18]  Mateo Valero,et al.  Multiple-banked register file architectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[19]  Kenneth C. Sevcik,et al.  Hot spot analysis in large scale shared memory multiprocessors , 1993, Supercomputing '93. Proceedings.

[20]  Evangelos P. Markatos,et al.  Using processor affinity in loop scheduling on shared-memory multiprocessors , 1992, Supercomputing '92.

[21]  Cezary Dubnicki The effects of block size on the performance of coherent caches in shared-memory multiprocessors , 1993 .