Hiding message delivery latency using Direct-to-Cache-Transfer techniques in message passing environments

Communication overhead is the key obstacle to reaching hardware performance limits. The majority is associated with software overhead, a significant portion of which is attributed to message copying. To reduce this copying overhead, we have devised techniques that do not require to copy a received message in order for it to be bound to its final destination. Rather, a late-binding mechanism, which involves address translation and a dedicated cache, facilitates fast access to received messages by the consuming process/thread. We have introduced two policies namely Direct to Cache Transfer (DTCT) and lazy DTCT that determine whether a message after it is bound needs to be transferred into the data cache. We have studied the proposed methods in simulation and have shown their effectiveness in reducing access times to message payloads by the consuming process.

[1]  Nikitas J. Dimopoulos,et al.  Comparing Direct-to-Cache Transfer Policies to TCP/IP and M-VIA During Receive Operations in MPI Environments , 2007, ISPA.

[2]  S. Hioki Construction of Staples in Lattice Gauge Theory on a Parallel Computer , 1996, Parallel Comput..

[3]  Liviu Iftode,et al.  TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance , 2002 .

[4]  Cezary Dubnicki,et al.  VMMC-2 : Efficient Support for Reliable, Connection-Oriented Communication , 1997 .

[5]  José González,et al.  Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[6]  Ian Foster,et al.  Parallel Spectral Transform Shallow Water Model: a runtime-tunable parallel benchmark code , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[7]  Nikitas J. Dimopoulos,et al.  Hiding message delivery and reducing memory access latency by providing direct-to-cache transfer during receive operations in a message passing environment , 2006, Medea.

[8]  David E. Culler,et al.  High-performance local area communication with fast sockets , 1997 .

[9]  Nikitas J. Dimopoulos,et al.  Efficient Communication Using Message Prediction for Cluster Multiprocessors , 2000, CANPC.

[10]  Ram Huggahalli,et al.  Direct cache access for high bandwidth network I/O , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[11]  Nikitas J. Dimopoulos,et al.  Lazy direct-to-cache transfer during receive operations in a message passing environment , 2006, CF '06.

[12]  Thorsten von Eicken,et al.  Memory management for user-level network interfaces , 1998, IEEE Micro.

[13]  Greg J. Regnier,et al.  The Virtual Interface Architecture , 2002, IEEE Micro.

[14]  David A. Patterson,et al.  Latency lags bandwith , 2004, CACM.

[15]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[16]  David A. Patterson,et al.  Latency Lags Bandwidth , 2005, ICCD.

[17]  Dhabaleswar K. Panda,et al.  MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems , 2001, IEEE Trans. Parallel Distributed Syst..

[18]  Nikitas J. Dimopoulos,et al.  Architectural extensions to support effcient communication using message prediction , 2002, Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications.

[19]  Hsiao-Keng Jerry Chu,et al.  Zero-Copy TCP in Solaris , 1996, USENIX Annual Technical Conference.

[20]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[21]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[22]  David J. Lilja,et al.  Characterization of Communication Patterns in Message-Passing Parallel Scientific Application Programs , 1998, CANPC.

[23]  Ronald G. Dreslinski,et al.  Performance analysis of system overheads in TCP/IP workloads , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).