VELO: A Novel Communication Engine for Ultra-Low Latency Message Transfers

This paper presents a novel stateless, virtualized communication engine for sub-microsecond latency. Using a field-programmable-gate-array (FPGA) based prototype we show a latency of 970 ns between two machines with our virtualized engine for low overhead (VELO). The FPGA device is directly connected to the CPUs by a hypertransport link. The described hardware architecture is optimized for small messages and avoids the overhead typically found with direct-memory access (DMA) controlled transfers. The stateless approach allows to use the hardware unit directly from many threads and processes simultaneously. It provides a secure user level communication with an extremely optimized start-up phase. Micro benchmarks results are reported both based on proprietary API and OpenMPI basis.

[1]  Holger Fröning,et al.  Performance Evaluation of the ATOLL Interconnect , 2005, Parallel and Distributed Computing and Networks.

[2]  Kai Li,et al.  Early Experience with Message-Passing on the SHRIMP Multicomputer , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[3]  Wu-chun Feng,et al.  The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.

[4]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[5]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[6]  Karsten Schwan,et al.  High performance and scalable I/O virtualization via self-virtualized devices , 2007, HPDC '07.

[7]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[8]  Ulrich Brüning,et al.  A versatile, low latency HyperTransport core , 2007, FPGA '07.

[9]  David Slogsnat,et al.  The HTX-Board : A Rapid Prototyping Station , 2005 .

[10]  Hideharu Amano,et al.  On-the-fly sending: a low latency high bandwidth message transfer mechanism , 2000, Proceedings International Symposium on Parallel Architectures, Algorithms and Networks. I-SPAN 2000.

[11]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[12]  Dave Olson,et al.  Pathscale InfiniPath: a first look , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[13]  José Duato,et al.  A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks , 2005, 11th International Symposium on High-Performance Computer Architecture.

[14]  Hideharu Amano,et al.  Martini: A Network Interface Controller Chip for High Performance Computing with Distributed PCs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[15]  Holger Fröning,et al.  A new ultra-low latency message transfer mechanism , 2007 .

[16]  Keith D. Underwood,et al.  Initial performance evaluation of the Cray SeaStar interconnect , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[17]  Keith D. Underwood,et al.  A preliminary analysis of the InfiniPath and XD1 network interfaces , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[18]  Katherine A. Yelick,et al.  Communication optimizations for fine-grained UPC applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[19]  Pedro López,et al.  A family of mechanisms for congestion control in wormhole networks , 2005, IEEE Transactions on Parallel and Distributed Systems.