A hardware acceleration unit for MPI queue processing

With the heavy reliance of modern scientific applications upon the MPI Standard, it has become critical for the implementation of MPI to be as capable and as fast as possible. This has led some of the fastest modern networks to introduce the capability to offload aspects of MPI processing to an embedded processor on the network interface. With this important capability has come significant performance implications. Most notably, the time to process long queues of posted receives or unexpected messages is substantially longer on embedded processors. This paper presents an associative list matching structure to accelerate the processing of moderate length queues in MPI. Simulations are used to compare the performance of an embedded processor augmented with this capability to a baseline implementation. The proposed enhancement significantly reduces latency for moderate length queues while adding virtually no overhead for extremely short queues.

[1]  Keith D. Underwood,et al.  The impact of MPI queue usage on message latency , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[2]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[3]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[4]  Dhabaleswar K. Panda,et al.  Fast NIC-based barrier over Myrinet/GM , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[5]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[6]  Rolf Riesen,et al.  Design, Implementation, and Performance of MPI on Portals 3.0 , 2003, Int. J. High Perform. Comput. Appl..

[7]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[8]  Keith D. Underwood,et al.  Enhancing NIC performance for MPI using processing-in-memory , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[9]  Keith D. Underwood,et al.  An analysis of NIC resource usage for offloading MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[10]  SkjellumAnthony,et al.  A high-performance, portable implementation of the MPI message passing interface standard , 1996 .

[11]  Rolf Riesen,et al.  Portals 3.0: protocol building blocks for low overhead communication , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[12]  Ron Brightwell,et al.  The Portals 3.0 Message Passing Interface Revision 1.0 , 1999 .

[13]  P. Wyckoff,et al.  EMP: Zero-Copy OS-Bypass NIC-Driven Gigabit Ethernet Message Passing , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[14]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .

[15]  Keith D. Underwood,et al.  A preliminary analysis of the MPI queue characterisitics of several applications , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[16]  D.E. Culler,et al.  Effects Of Communication Latency, Overhead, And Bandwidth In A Cluster Architecture , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[17]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[18]  Ronald Minnich,et al.  A Network-Failure-Tolerant Message-Passing System for Terascale Clusters , 2002, ICS '02.

[19]  D.K. Panda,et al.  Scalable NIC-based Reduction on Large-scale Clusters , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[20]  Bernhard Plattner,et al.  Scalable high-speed prefix matching , 2001, TOCS.

[21]  William Gropp,et al.  MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.

[22]  Karl S. Hemmert,et al.  A CAD suite for high-performance FPGA design , 1999, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00375).

[23]  Wu-chun Feng,et al.  The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.

[24]  D. Panda,et al.  NIC-Based Reduction in Myrinet Clusters: Is It Beneficial? , 2003 .

[25]  Paul D. Gader,et al.  Image algebra techniques for parallel image processing , 1987 .