TCP offload through connection handoff

This paper presents a connection handoff interface between the operating system and the network interface. Using this interface, the operating system can offload a subset of TCP connections in the system to the network interface, while the remaining connections are processed on the host CPU. Offloading can reduce computation and memory bandwidth requirements for packet processing on the host CPU. However, full TCP offloading may degrade system performance because finite processing and memory resources on the network interface limit the amount of packet processing and the number of connections. Using handoff, the operating system controls the number of offloaded connections in order to fully utilize the network interface without overloading it. Handoff is transparent to the application, and the operating system may choose to offload connections to the network interface or reclaim them from the interface at any time. A prototype system based on the modified FreeBSD operating system shows that handoff reduces the number of instructions and cache misses on the host CPU. As a result, the number of CPU cycles spent processing each packet decreases by 16--84%. Simulation results show handoff can improve web server throughput (SEPCweb99) by 15%, despite short-lived connections.

[1]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[2]  Lizy Kurian John,et al.  Benchmarking Internet Servers on Superscalar Machines , 2001 .

[3]  Ronald Minnich,et al.  The memory-integrated network interface , 1995, IEEE Micro.

[4]  Dhabaleswar K. Panda,et al.  Head-to-TOE Evaluation of High-Performance Sockets over Protocol Offload Engines , 2005, 2005 IEEE International Conference on Cluster Computing.

[5]  Eric Van Hensbergen,et al.  KNITS: switch-based connection hand-off , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[6]  S. Tam,et al.  A 130-nm triple-V/sub t/ 9-MB third-level on-die cache for the 1.7-GHz Itanium/spl reg/ 2 processor , 2005, IEEE Journal of Solid-State Circuits.

[7]  Kai Kunze,et al.  Studying network protocol offload with emulation: approach and preliminary results , 2004, Proceedings. 12th Annual IEEE Symposium on High Performance Interconnects.

[8]  Sally Floyd,et al.  The NewReno Modification to TCP's Fast Recovery Algorithm , 2004, RFC.

[9]  Liviu Iftode,et al.  TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance , 2002 .

[10]  Ali G. Saidi,et al.  Sampling and Stability in TCP/IP Workloads , 2005 .

[11]  Hemal Shah,et al.  Direct Data Placement over Reliable Transports , 2007, RFC.

[12]  Dhabaleswar K. Panda,et al.  Performance characterization of a 10-Gigabit Ethernet TOE , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[13]  Luigi Rizzo,et al.  Dummynet: a simple approach to the evaluation of network protocols , 1997, CCRV.

[14]  KimHyong-youb,et al.  TCP offload through connection handoff , 2006 .

[15]  Lawrence T. Clark,et al.  An embedded 32-b microprocessor core for low-power and high-performance applications , 2001 .

[16]  James R. Goodman,et al.  Limited bandwidth to affect processor design , 1997, IEEE Micro.

[17]  Erich M. Nahum,et al.  Networking support for large scale multiprocessor servers , 1996, SIGMETRICS '96.

[18]  Ren Wang,et al.  TCP westwood: Bandwidth estimation for enhanced transport over wireless links , 2001, MobiCom '01.

[19]  David Clark,et al.  Architectural considerations for a new generation of protocols , 1990, SIGCOMM 1990.

[20]  Jeffrey C. Mogul,et al.  TCP Offload Is a Dumb Idea Whose Time Has Come , 2003, HotOS.

[21]  Terry Lyon,et al.  Data Cache design considerations for the Itanium/sub /spl reg// 2 Processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[22]  Erich M. Nahum,et al.  Server Network Scalability and TCP Offload , 2005, USENIX Annual Technical Conference, General Track.

[23]  Min Xu,et al.  Evaluating Non-deterministic Multi-threaded Commercial Workloads , 2001 .

[24]  Mats Björkman,et al.  Performance modeling of multiprocessor implementations of protocols , 1998, TNET.

[25]  Sally Floyd,et al.  The NewReno Modification to TCP's Fast Recovery Algorithm , 2004, RFC.

[26]  Donald Yeung,et al.  Physical experimentation with prefetching helper threads on Intel's hyper-threaded processors , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[27]  Matthew Mathis,et al.  Forward acknowledgement: refining TCP congestion control , 1996, SIGCOMM 1996.

[28]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[29]  Jonathan Lemon Kqueue - A Generic and Scalable Event Notification Facility , 2001, USENIX Annual Technical Conference, FREENIX Track.

[30]  Ronald G. Dreslinski,et al.  Analyzing NIC Overheads in Network-Intensive Workloads , 2005 .

[31]  Peter Druschel,et al.  Measuring the Capacity of a Web Server , 1997, USENIX Symposium on Internet Technologies and Systems.

[32]  Paul E. McKenney,et al.  Efficient demultiplexing of incoming TCP packets , 1992, SIGCOMM '92.

[33]  Greg Kroah-Hartman,et al.  Linux Device Drivers, 3rd Edition , 2005 .

[34]  Carey L. Williamson,et al.  Internet Web servers: workload characterization and performance implications , 1997, TNET.

[35]  Erich M. Nahum,et al.  Cache behavior of network protocols , 1997, SIGMETRICS '97.

[36]  Paul E. McKenney,et al.  Efficient demultiplexing of incoming TCP packets , 1992, SIGCOMM 1992.

[37]  Willy Zwaenepoel,et al.  Flash: An efficient and portable Web server , 1999, USENIX Annual Technical Conference, General Track.

[38]  Brian Zill,et al.  Software support for outboard buffering and checksumming , 1995, SIGCOMM '95.

[39]  Harrick M. Vin,et al.  Half-pipe anchoring: an efficient technique for multiple connection handoff , 2002, 10th IEEE International Conference on Network Protocols, 2002. Proceedings..

[40]  Vikram A. Saletore,et al.  ETA: experience with an Intel Xeon processor as a packet processing engine , 2004, IEEE Micro.

[41]  Alan L. Cox,et al.  An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems , 2006, USENIX Annual Technical Conference, General Track.

[42]  Vincent Roca,et al.  Demultiplexed architectures: a solution for efficient STREAMS-based communication stacks , 1997, IEEE Netw..

[43]  Rohit Bhatia,et al.  Montecito: a dual-core, dual-thread Itanium processor , 2005, IEEE Micro.

[44]  David Clark,et al.  An analysis of TCP processing overhead , 1989 .

[45]  Shubhendu S. Mukherjee,et al.  Coherent Network Interfaces for Fine-Grain Communication , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[46]  Nathan L. Binkert,et al.  Network-Oriented Full-System Simulation using M5 , 2003 .

[47]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[48]  Jeffrey S. Chase,et al.  On the elusive benefits of protocol offload , 2003, NICELI '03.

[49]  Jeffrey C. Mogul,et al.  Unveiling the transport , 2004, CCRV.

[50]  Scott Rixner,et al.  Performance Characterization of the FreeBSD Network Stack , 2005 .

[51]  Sandy Irani,et al.  Cost-Aware WWW Proxy Caching Algorithms , 1997, USENIX Symposium on Internet Technologies and Systems.

[52]  Vikram A. Saletore,et al.  Evaluating network processing efficiency with processor partitioning and asynchronous I/O , 2006, EuroSys.

[53]  Daniel Pierre Bovet,et al.  Understanding the Linux Kernel , 2000 .

[54]  Harry Muljono,et al.  A 1.5-GHz 130-nm Itanium/sup /spl reg// 2 Processor with 6-MB on-die L3 cache , 2003 .

[55]  Van Jacobson,et al.  TCP Extensions for High Performance , 1992, RFC.

[56]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, TOCS.

[57]  Sally Floyd,et al.  TCP Selective Acknowledgement Options , 1996 .

[58]  Greg Kroah-Hartman,et al.  Linux Device Drivers , 1998 .

[59]  Hsiao-Keng Jerry Chu,et al.  Zero-Copy TCP in Solaris , 1996, USENIX Annual Technical Conference.

[60]  Srinivasan Seshan,et al.  The effects of wide-area conditions on WWW server performance , 2001, SIGMETRICS '01.

[61]  Larry L. Peterson,et al.  Fbufs: a high-bandwidth cross-domain transfer facility , 1994, SOSP '93.

[62]  Greg J. Regnier,et al.  The Virtual Interface Architecture , 2002, IEEE Micro.

[63]  Balaram Sinharoy,et al.  IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.

[64]  Ronald G. Dreslinski,et al.  Performance analysis of system overheads in TCP/IP workloads , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[65]  Thomas Herbert The Linux TCP/IP Stack: Networking for Embedded Systems (Networking Series) , 2004 .

[66]  Ram Huggahalli,et al.  Direct Cache Access for High Bandwidth Network I/O , 2005, ISCA 2005.

[67]  Ram Huggahalli,et al.  Direct cache access for high bandwidth network I/O , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[68]  Renato Recio,et al.  An RDMA Protocol Specification , 2002 .

[69]  Klaus Wehrle,et al.  The Linux networking architecture : design and implementation of network protocols in the Linux kernel , 2005 .

[70]  Van Jacobson,et al.  Random early detection gateways for congestion avoidance , 1993, TNET.

[71]  Willy Zwaenepoel,et al.  IO-Lite: a unified I/O buffering and caching system , 1999, TOCS.

[72]  John Paul Shen,et al.  Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform , 2004, ASPLOS XI.

[73]  Sriram R. Vangal,et al.  A TCP offload accelerator for 10 Gb/s Ethernet in 90-nm CMOS , 2003, IEEE J. Solid State Circuits.

[74]  Larry Peterson,et al.  TCP Vegas: new techniques for congestion detection and avoidance , 1994, SIGCOMM 1994.

[75]  Erich M. Nahum,et al.  Locality-aware request distribution in cluster-based network servers , 1998, ASPLOS VIII.

[76]  Erich M. Nahum,et al.  Performance issues in parallelized network protocols , 1994, OSDI '94.

[77]  Scott Rixner,et al.  Connection handoff policies for TCP offload network interfaces , 2006, OSDI '06.

[78]  Vern Paxson,et al.  TCP Congestion Control , 1999, RFC.

[79]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[80]  Scott Rixner,et al.  Memory Controller Optimizations for Web Servers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[81]  Hari Balakrishnan,et al.  Fine-Grained Failover Using Connection Migration , 2001, USITS.

[82]  Greg J. Regnier,et al.  TCP onloading for data center servers , 2004, Computer.

[83]  Srihari Makineni,et al.  Architectural characterization of TCP/IP packet processing on the Pentium/spl reg/ M microprocessor , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).