Network Interface Design for Low Latency Request-Response Protocols

Ethernet network interfaces in commodity systems are designed with a focus on achieving high bandwidth at low CPU utilization, while often sacrificing latency. This approach is viable only if the high interface latency is still overwhelmingly dominated by software request processing times. However, recent efforts to lower software latency in request-response based systems, such as memcached and RAMCloud, have promoted network interface into a significant contributor to the overall latency. We present a low latency network interface design suitable for request-response based applications. Evaluation on a prototype FPGA implementation has demonstrated that our design exhibits more than double latency improvements without a meaningful negative impact on either bandwidth or CPU power. We also investigate latency-power tradeoffs between using interrupts and polling, as well as the effects of processor's low power states.

[1]  J. Mugler,et al.  Proceedings Formatting Team , 2002 .

[2]  Ram Huggahalli,et al.  Direct cache access for high bandwidth network I/O , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[3]  Parag Agrawal,et al.  The case for RAMCloud , 2011, Commun. ACM.

[4]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[5]  Thomas F. Wenisch,et al.  PowerNap: eliminating server idle power , 2009, ASPLOS.

[6]  Luca Deri nCap: wire-speed packet capture and transmission , 2005, Workshop on End-to-End Monitoring Techniques and Services, 2005..

[7]  Amin Vahdat,et al.  Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center , 2012, NSDI.

[8]  Fabrizio Petrini,et al.  Streaming, low-latency communication in on-line trading systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[9]  Mingyu Chen,et al.  DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[10]  K. K. Ramakrishnan,et al.  Eliminating receive livelock in an interrupt-driven kernel , 1996, TOCS.

[11]  Luigi Rizzo Revisiting Network I/O APIs: The netmap Framework , 2012, ACM Queue.

[12]  Amin Vahdat,et al.  Chronos: predictable low latency for data center applications , 2012, SoCC '12.

[13]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[14]  Khaled Salah To coalesce or not to coalesce , 2007 .

[15]  Ahmad Afsahi,et al.  10-Gigabit iWARP Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[16]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[17]  Thomas F. Wenisch,et al.  Power management of online data-intensive services , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[18]  Henry M. Levy,et al.  Limits to low-latency communication on high-speed networks , 1993, TOCS.

[19]  Ram Huggahalli,et al.  Direct Cache Access for High Bandwidth Network I/O , 2005, ISCA 2005.