Performance Evaluation of Clusters with ccNUMA Nodes - A Case Study

In the quest for higher performance and with the increasing availability of multicore chips, many systems are currently packing more processors per node. Adopting a ccNUMA node architecture in these cases has the promise of achieving a balance between cost and performance. In this paper, a 2312 Opteron cores system based on Sun Fire servers is considered as a case study to examine the performance issues associated with such architectures. In this work, we characterize the performance behavior of the system with focus on the node level using different configurations. It will be shown that the benefits from larger nodes can be severely limited due to many reasons. These reasons were isolated and the associated performance losses were assessed. The results revealed that such problems were mainly caused by topological imbalances, limitations of the used cache coherency protocol, operating system services distribution, and the lack of intelligent management of memory affinity.

[1]  Jean-Loup Baer,et al.  Optimizing Software Cache-coherent Cluster Architectures , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[2]  J.P. Singh,et al.  Scaling application performance on a cache-coherent multiprocessors , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[3]  José González,et al.  A two-level directory architecture for highly scalable cc-NUMA multiprocessors , 2005, IEEE Transactions on Parallel and Distributed Systems.

[4]  Bronis R. de Supinski,et al.  A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks , 2005, ICS '05.

[5]  Jeffrey S. Vetter,et al.  Performance evaluation of the SGI Altix 3700 , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[6]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[7]  Hsien-Hsin S. Lee,et al.  Integrating cache coherence protocols for heterogeneous multiprocessor systems. 1 , 2004, IEEE Micro.

[8]  Jarek Nieplocha,et al.  Topology-aware tile mapping for clusters of SMPs , 2006, CF '06.

[9]  Sadaf R. Alam,et al.  Characterization of Scientific Workloads on Systems with Multi-Core Processors , 2006, 2006 IEEE International Symposium on Workload Characterization.

[10]  John L. Hennessy,et al.  An evaluation of a commercial CC-NUMA architecture-the CONVEX Exemplar SPP1200 , 1997, Proceedings 11th International Parallel Processing Symposium.

[11]  Daehyun Kim,et al.  Active memory techniques for ccNUMA multiprocessors , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[12]  Tarek A. El-Ghazawi,et al.  Experimental Evaluation of Emerging Multi-core Architectures , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[13]  Laxmi N. Bhuyan,et al.  Switch MSHR: A Technique to Reduce Remote Read Memory Access Time in CC-NUMA Multiprocessors , 2003, IEEE Trans. Computers.

[14]  Frank Mueller,et al.  Hardware profile-guided automatic page placement for ccNUMA systems , 2006, PPoPP '06.

[15]  Chris Johnson,et al.  Data Distribution , Migration and Replication on a cc-NUMA Architecture , 2002 .

[16]  Anoop Gupta,et al.  Cache-coherent distributed shared memory: perspectives on its development and future challenges , 1999, Proc. IEEE.

[17]  Nor Asilah Wati Abdul Hamid,et al.  Comparison of MPI benchmark programs on an SGI Altix ccNUMA shared memory machine , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[18]  Douglas M. Blough,et al.  Integrating cache coherence protocols for heterogeneous multiprocessor system. Part 2 , 2004, IEEE Micro.