TCP Stalls at the Server Side: Measurement and Mitigation

TCP is an important factor affecting user-perceived performance of Internet applications. Diagnosing the causes behind TCP performance issues in the wild is essential for better understanding the current shortcomings in TCP. This paper presents a TCP flow performance analysis framework that classifies causes of TCP stalls. The framework forms the basis of a tool that we use to analyze packet-level traces of three services (cloud storage, software download, and web search) deployed by a popular service provider. We find that as many as 20% of the flows are stalled for half of their lifetime. Network-related causes, especially timeout retransmissions, dominate the stalls. A breakdown of the causes for timeout retransmission stalls reveals that double retransmission and tail retransmission are among the top contributors. The importance of these causes depends however on the specific service. Based on these observations, we propose smart-retransmission time out (S-RTO), a mechanism that mitigates timeout retransmission stalls through careful and gentle aggression for retransmission. S-RTO is evaluated in a controlled network and also in a production network. The results consistently show that it is effective at improving TCP performance, especially for short flows.

[1]  Steve Uhlig,et al.  Demystifying and mitigating TCP stalls at the server side , 2015, CoNEXT.

[2]  Michael Welzl,et al.  An Evaluation of Tail Loss Recovery Mechanisms for TCP , 2015, CCRV.

[3]  Scott Shenker,et al.  Recursively Cautious Congestion Control , 2014, NSDI.

[4]  Victor O. K. Li,et al.  TCP-NCL: A unified solution for TCP packet reordering and random loss , 2009, 2009 IEEE 20th International Symposium on Personal, Indoor and Mobile Radio Communications.

[5]  Konstantin Avrachenkov,et al.  Early Retransmit for TCP and Stream Control Transmission Protocol (SCTP) , 2010, RFC.

[6]  Qiang Xu,et al.  An untold story of middleboxes in cellular networks , 2011, SIGCOMM 2011.

[7]  Janardhan Iyengar,et al.  QUIC Congestion Control And Loss Recovery , 2016 .

[8]  Vern Paxson,et al.  Computing TCP's Retransmission Timer , 2000, RFC.

[9]  Minlan Yu,et al.  Identifying performance bottlenecks in CDNs through TCP-level monitoring , 2011, W-MUST '11.

[10]  Tomaso de Cola,et al.  A simulation study of network-coding-enhanced PEP for TCP flows in GEO satellite networks , 2014, 2014 IEEE International Conference on Communications (ICC).

[11]  Minlan Yu,et al.  Profiling Network Performance for Multi-tier Data Center Applications , 2011, NSDI.

[12]  Scott Shenker,et al.  On the characteristics and origins of internet flow rates , 2002, SIGCOMM.

[13]  Steven H. Low,et al.  TCP Pacing Revisited , 2022 .

[14]  Hari Balakrishnan,et al.  An experimental study of the learnability of congestion control , 2014, SIGCOMM.

[15]  Mark Handley,et al.  Congestion control for high bandwidth-delay product networks , 2002, SIGCOMM '02.

[16]  Gaogang Xie,et al.  An Empirical Analysis of a Large-scale Mobile Cloud Storage Service , 2016, Internet Measurement Conference.

[17]  Samba Siva Reddy Maripalli Congestion Control for TCP in DataCenter Networks , 2014 .

[18]  Ramesh Govindan,et al.  Reducing web latency: the virtue of gentle aggression , 2013, SIGCOMM.

[19]  Mark Allman,et al.  Using TCP Duplicate Selective Acknowledgement (DSACKs) and Stream Control Transmission Protocol (SCTP) Duplicate Transmission Sequence Numbers (TSNs) to Detect Spurious Retransmissions , 2004, RFC.

[20]  Mo Dong,et al.  PCC: Re-architecting Congestion Control for Consistent High Performance , 2014, NSDI.

[21]  Haitao Wu,et al.  ICTCP: Incast Congestion Control for TCP in Data-Center Networks , 2013, IEEE/ACM Transactions on Networking.

[22]  Mo Dong,et al.  Halfback: running short flows quickly and safely , 2015, CoNEXT.

[23]  Qian Zhang,et al.  A Compound TCP Approach for High-Speed and Long Distance Networks , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[24]  Matthew Mathis,et al.  Forward acknowledgement: refining TCP congestion control , 1996, SIGCOMM '96.

[25]  Ming Zhang,et al.  RR-TCP: a reordering-robust TCP with DSACK , 2003, 11th IEEE International Conference on Network Protocols, 2003. Proceedings..

[26]  Nick McKeown,et al.  Virtualized Congestion Control , 2016, SIGCOMM.

[27]  Lili Wang,et al.  A Conservative Loss Recovery Algorithm Based on Selective Acknowledgment (SACK) for TCP , 2012, RFC.

[28]  Hari Balakrishnan,et al.  Network Working Group , 1991 .

[29]  Injong Rhee,et al.  CUBIC: a new TCP-friendly high-speed TCP variant , 2008, OPSR.

[30]  Andreas Terzis,et al.  packetdrill: Scriptable Network Stack Testing, from Sockets to Packets , 2013, USENIX Annual Technical Conference.

[31]  Keqiang He,et al.  AC/DC TCP: Virtual Congestion Control Enforcement for Datacenter Networks , 2016, SIGCOMM.

[32]  Mark Handley,et al.  Is it still possible to extend TCP? , 2011, IMC '11.

[33]  Cheng Jin,et al.  FAST TCP: Motivation, Architecture, Algorithms, Performance , 2006, IEEE/ACM Transactions on Networking.

[34]  Ian Swett,et al.  QUIC Loss Recovery And Congestion Control , 2015 .

[35]  Hari Balakrishnan,et al.  TCP ex machina: computer-generated congestion control , 2013, SIGCOMM.

[36]  Dan Pei,et al.  Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers , 2016, USENIX Annual Technical Conference.