Performance Benefits and Limitations of Large NUMA Multiprocessors

Abstract In scalable multiprocessor architectures, the times required for a processor to access various portions of memory are different. In this paper, we consider how this characteristic affects performance by comparing it to the ideal but unrealizable case in which the access times to all memory modules can be kept constant, even as the number of processors is increased. We examine several application kernels to investigate how well they would execute on various instances of NUMA systems with a hierarchical memory structure. The results of our analytic model show that access locality is much more important in NUMA architectures than it is in UMA architectures. The extent of the performance penalty of non-local memory accesses depends on the variability in access times to various parts of shared memory, as well as on the amount of congestion in the interconnection network that provides access to remote memory modules. In the applications we examined, we found that it is possible to partition and locate both the data and the computation in such a way that reasonable speedups can be achieved on NUMA systems.

[1]  Michael Stumm,et al.  Hector-a hierarchically structured shared memory multiprocessor , 1991, Proceedings of the Twenty-Fourth Annual Hawaii International Conference on System Sciences.

[2]  Raj Vaswani,et al.  A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors , 1993, TOCS.

[3]  Tim Brecht,et al.  Processor-pool-based scheduling for large-scale NUMA multiprocessors , 1991, SIGMETRICS '91.

[4]  Kirk L. Johnson The impact of communication locality on large-scale multiprocessor performance , 1992, ISCA '92.

[5]  Edward D. Lazowska,et al.  Quantitative System Performance , 1985, Int. CMG Conference.

[6]  Frederica Darema,et al.  Memory access patterns of parallel scientific programs , 1987, SIGMETRICS '87.

[7]  Michael Stumm,et al.  Performance Evaluation of Hierarchical Ring-Based Shared Memory Multiprocessors , 1994, IEEE Trans. Computers.

[8]  Anoop Gupta,et al.  The impact of operating system scheduling policies and synchronization methods of performance of parallel applications , 1991, SIGMETRICS '91.

[9]  William H. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[10]  William H. Press,et al.  Numerical recipes : the art of scientific computing : FORTRAN version , 1989 .

[11]  T. Mowry,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[12]  Anoop Gupta,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.

[13]  Eyal Zimran,et al.  Performance Efficient Mapping of Applications to Parallel and Distributed Architectures , 1990, ICPP.

[14]  J. E. Glynn,et al.  Numerical Recipes: The Art of Scientific Computing , 1989 .

[15]  Luigi Brochard,et al.  Designing algorithms on hierarchical memory multiprocessors , 1990, ICS '90.