A fast and resource-conscious MPI message queue mechanism for large-scale jobs

The Message Passing Interface (MPI) message queues have been shown to grow proportionately to the job size for many applications. With such a behaviour and knowing that message queues are used very frequently, ensuring fast queue operations at large scales is of paramount importance in the current and the upcoming exascale computing eras. Scalability, however, is two-fold. With the growing processor core density per node, and the expected smaller memory density per core at larger scales, a queue mechanism that is blind on memory requirements poses another scalability issue even if it solves the speed of operation problem. In this work we propose a multidimensional queue management mechanism whose operation time and memory overhead grow sub-linearly with the job size. We show why a novel approach is justified in spite of the existence of well-known and fast data structures such as binary search trees. We compare our proposal with a linked list-based approach which is not scalable in terms of speed of operation, and with an array-based method which is not scalable in terms of memory consumption. Our proposed multidimensional approach yields queue operation time speedups that translate to up to 4-fold execution time improvement over the linked list design for the applications studied in this work. It also shows a consistent lower memory footprint compared to the array-based design. Finally, compared to the linked list-based queue, our proposed design yields cache miss rate improvements which are on average on par with the array-based design. A new MPI message queue design tailored for very large-scale jobs.A design based on a 4-D data container that exploits process rank decomposition.The effect of job size on message queue operations is mitigated.Infructuous message queue searches are optimized via early detection.Scalability is provided for both execution speed and memory consumption.

[1]  Andrey Tovchigrechko,et al.  Parallelizing BLAST and SOM Algorithms with MapReduce-MPI Library , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[2]  Haiyang Zhang,et al.  Red-Black Tree Used for Arranging Virtual Memory Area of Linux , 2010, 2010 International Conference on Management and Service Science.

[3]  William Gropp,et al.  Erratum: The importance of non-data-communication overheads in MPI (International Journal of High Performance Computing Applications (2010) 24: 1 DOI: 10.1177/1094342009359258) , 2010 .

[4]  Ahmad Afsahi,et al.  An Efficient MPI Message Queue Mechanism for Large-scale Jobs , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[5]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[6]  Chun Chen,et al.  Improving High-Performance Sparse Libraries Using Compiler-Assisted Specialization: A PETSc Case Study , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[7]  Wu-chun Feng,et al.  The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.

[8]  Ahmad Afsahi,et al.  Investigating Scenario-Conscious Asynchronous Rendezvous over RDMA , 2011, 2011 IEEE International Conference on Cluster Computing.

[9]  Jesper Larsson Träff,et al.  MPI on a Million Processors , 2009, PVM/MPI.

[10]  Leonid Oliker,et al.  Message passing and shared address space parallelism on an SMP cluster , 2003, Parallel Comput..

[11]  Nathan T. Hjelm,et al.  Performance Evaluation of Open MPI on Cray XE/XK Systems , 2012, 2012 IEEE 20th Annual Symposium on High-Performance Interconnects.

[12]  Jack Dongarra,et al.  Recent Advances in the Message Passing Interface - 17th European MPI Users' Group Meeting, EuroMPI 2010, Stuttgart, Germany, September 12-15, 2010. Proceedings , 2010, EuroMPI.

[13]  Bernard Tourancheau,et al.  Support for MPI at the Network Interface Level , 2001, PVM/MPI.

[14]  Rajeev Thakur,et al.  The Importance of Non-Data-Communication Overheads in MPI , 2010, HiPC 2010.

[15]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[16]  Torsten Hoefler,et al.  Writing Parallel Libraries with MPI - Common Practice, Issues, and Extensions , 2011, EuroMPI.

[17]  Keith D. Underwood,et al.  A preliminary analysis of the MPI queue characterisitics of several applications , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[18]  Dhabaleswar K. Panda,et al.  TupleQ: Fully-asynchronous and zero-copy MPI over InfiniBand , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[19]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[20]  Keith D. Underwood,et al.  An analysis of NIC resource usage for offloading MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[21]  Ahmad Afsahi,et al.  A Speculative and Adaptive MPI Rendezvous Protocol Over RDMA-enabled Interconnects , 2009, International Journal of Parallel Programming.

[22]  K. Gopalakrishnan,et al.  Natively Supporting True One-Sided Communication in  MPI on Multi-core Systems with InfiniBand , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[23]  Brian W. Barrett,et al.  An Evaluation of Open MPI's Matching Transport Layer on the Cray XT , 2007, PVM/MPI.

[24]  Dhabaleswar K. Panda,et al.  EMP: Zero-Copy OS-Bypass NIC-Driven Gigabit Ethernet Message Passing , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[25]  Jarek Nieplocha,et al.  An evaluation of two implementation strategies for optimizing one-sided atomic reduction , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[26]  Richard L. Graham,et al.  Characteristics of the Unexpected Message Queue of MPI Applications , 2010, EuroMPI.

[27]  Keith D. Underwood,et al.  Implications of application usage characteristics for collective communication offload , 2006, Int. J. High Perform. Comput. Netw..

[28]  Rajeev Thakur,et al.  The Importance of Non-Data-Communication Overheads in MPI , 2010, Int. J. High Perform. Comput. Appl..

[29]  Sandia Report,et al.  The Portals 4.0 Message Passing Interface , 2008 .

[30]  Karl S. Hemmert,et al.  A hardware acceleration unit for MPI queue processing , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[31]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[32]  Noboru Tanabe,et al.  Network Interface Architecture for Scalable Message Queue Processing , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[33]  Keith D. Underwood,et al.  The impact of MPI queue usage on message latency , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..