Analyzing the Performance Bottlenecks of the POWER7-IH Network

In this work we provide an early performance analysis of the communication network in a small-scale POWER7-IH processing system from IBM. Using a set of communication micro-benchmarks we quantify the achievable bandwidth of the communication links available in the system that differ in their peak performance characteristics. We also identify the bottlenecks within the communication network and show that the bandwidth a single node can inject into the network is considerably less than the bandwidth available to the IBM hub chip, that acts as a NIC to the node as well as being an integral part of the P7-IH network. Using a communication pattern that is representative of activities in many scientific applications that have regular communication patterns, we show how the default task-to-core assignment on the P7-IH achieves sub-optimal performance in most cases. We also show that when using a diagonal-cyclic assignment, as developed in this work that takes into account the network topology as well as routing strategy, the communication performance can be improved by up to 75%. We expect even greater improvements in the communication performance on larger P7-IH systems.

[1]  Jeffrey S. Vetter,et al.  Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[2]  Darren J. Kerbyson,et al.  Optimized InfiniBand TM fat-tree routing for shift all-to-all communication patterns , 2010, ISC 2010.

[3]  Scott Clark,et al.  The IBM POWER7 HUB module: A terabyte interconnect switch for high-performance computer systems , 2010, 2010 IEEE Hot Chips 22 Symposium (HCS).

[4]  Scott Pakin,et al.  Entering the petaflop era: the architecture and performance of Roadrunner , 2008, HiPC 2008.

[5]  Torsten Hoefler,et al.  The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[6]  Laxmikant V. Kalé,et al.  Application-specific topology-aware mapping for three dimensional topologies , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[7]  José E. Moreira,et al.  Topology Mapping for Blue Gene/L Supercomputer , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[8]  Leonid Oliker,et al.  Analyzing Ultra-Scale Application Communication Requirements for a Reconfigurable Hybrid Interconnect , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[9]  Karl S. Hemmert,et al.  An application based MPI message throughput benchmark , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[10]  Darren J. Kerbyson,et al.  Automatic Identification of Application Communication Patterns via Templates , 2005, ISCA PDCS.

[11]  Leonid Oliker,et al.  Communication Requirements and Interconnect Optimization for High-End Scientific Applications , 2007, IEEE Transactions on Parallel and Distributed Systems.

[12]  Balaram Sinharoy,et al.  POWER7: IBM's next generation server processor , 2010, 2009 IEEE Hot Chips 21 Symposium (HCS).