Network Interface Architecture for Scalable Message Queue Processing

Most of scientists except computer scientists do not want to make efforts for performance tuning with rewriting their MPI applications. In addition, the number of processing elements which can be used by them is increasing year by year. On large-scale parallel systems, the number of accumulated messages on a message buffer tends to increase in some of their applications. Since searching message queue in MPI is time-consuming, system side scalable acceleration is needed for those systems. In this paper, a support function named LHS (Limited-length Head Separation) is proposed. Its performance in searching message buffer and hardware cost are evaluated. LHS accelerates searching message buffer by means of switching location to store limited-length heads of messages. It uses the effects such as increasing hit rate of cache on host with partial off-loading to hardware. Searching speed of message buffer when the order of message reception is different from the receiver's expectation is accelerated 14.3 times with LHS on FPGA-based network interface card (NIC) named DIMMnet-2. This absolute performance is 38.5 times higher than that of IBM BlueGene/P although the frequency is 8.5times slower than BlueGene/P. Hardware cost of LHS is significantly lower than that of ALPU, which is a hardware accelerator for searching message buffer. LHS has higher scalability than ALPU in the performance per frequency. Therefore, LHS is more suitable for larger parallel systems.

[1]  Keith D. Underwood,et al.  SeaStar Interconnect: Balanced Bandwidth for Scalable Performance , 2006, IEEE Micro.

[2]  Keith D. Underwood,et al.  An analysis of NIC resource usage for offloading MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[3]  Karl S. Hemmert,et al.  A hardware acceleration unit for MPI queue processing , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[5]  J. Nieplocha,et al.  QSNET/sup II/: defining high-performance network design , 2005, IEEE Micro.

[6]  Keith D. Underwood,et al.  A preliminary analysis of the MPI queue characterisitics of several applications , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[7]  Hideharu Amano,et al.  Prototyping on using a DIMM slot as a high-performance I/O interface , 2003, Innovative Architecture for Future Generation High-Performance Processors and Systems, 2003.

[8]  Keith D. Underwood,et al.  The impact of MPI queue usage on message latency , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[9]  Y. Dohi,et al.  A New Memory Module for COTS-Based Personal Supercomputing , 2004, Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'04).

[10]  H. Nakajo,et al.  Preliminary evaluations of a FPGA-based-prototype of DIMMnet-2 network interface , 2005, Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05).

[11]  H. Nakajo,et al.  Hardware Support for MPI in DIMMnet-2 Network Interface , 2006, International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems (IWIA'06).

[12]  Sayantan Sur,et al.  High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters , 2007, ICS '07.

[13]  Rajeev Thakur,et al.  Non-data-communication Overheads in MPI: Analysis on Blue Gene/P , 2008, PVM/MPI.

[14]  Noboru Tanabe,et al.  MEMOnet: network interface plugged into a memory slot , 2000, Proceedings IEEE International Conference on Cluster Computing. CLUSTER 2000.

[15]  Hideharu Amano,et al.  A low latency high bandwidth network interface prototype for PC cluster , 2002, International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems.

[16]  Akira Kitamura,et al.  Performance evaluation on low-latency communication mechanism of DIMMnet-2 , 2007, Parallel and Distributed Computing and Networks.