Riptide: Jump-Starting Back-Office Connections in Cloud Systems

Large-scale cloud networks are constantly driven by the need for improved performance in communication between datacenters. Indeed, such back-office communication makes up a large fraction of traffic in many cloud environments. This communication often occurs frequently, carrying control messages, coordination and load balancing information, and customer data. However, ensuring such inter-datacenter traffic is delivered efficiently requires optimizing connections over large physical distances, which is non-trivial. Worse still, many large cloud networks are subject to complex configuration and administrative restrictions, limiting the types of solutions that can be implemented. In this paper, we propose improving the efficiency of datacenter to datacenter communication by learning the congestion level of links in between. We then use this knowledge to inform new connections made between the relevant datacenters, allowing us to eliminate the overhead associated with traditional slow-start processes in new connections. We further present Riptide, a tool which implements this approach. We present the design and implementation details of Riptide, showing that it can be easily executed on modern Linux servers deployed in the real world. We further demonstrate that it successfully reduces total transfer times in a production global-scale content delivery network (CDN), providing up to a 30% decrease in tail latency. We further show that Riptide is simple to deploy and easy to maintain within a complex existing network.

[1]  Albert G. Greenberg,et al.  The nature of data center traffic: measurements & analysis , 2009, IMC '09.

[2]  David Wetherall,et al.  Demystifying Page Load Performance with WProf , 2013, NSDI.

[3]  Ramesh Govindan,et al.  Mapping the expansion of Google's serving infrastructure , 2013, Internet Measurement Conference.

[4]  Anja Feldmann,et al.  Enabling content-aware traffic engineering , 2012, CCRV.

[5]  Amit Agarwal,et al.  An argument for increasing TCP's initial congestion window , 2010, CCRV.

[6]  Ratul Mahajan,et al.  A provider-side view of web search response time , 2013, SIGCOMM.

[7]  Anja Feldmann,et al.  Back-Office Web Traffic on The Internet , 2014, Internet Measurement Conference.

[8]  Balachander Krishnamurthy,et al.  DEW: DNS-enhanced web for faster content delivery , 2003, WWW '03.

[9]  Hua Wang,et al.  DHTTP: an efficient and cache-friendly transfer protocol for Web traffic , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[10]  Aleksandar Kuzmanovic,et al.  Removing exponential backoff from TCP , 2008, CCRV.

[11]  Keith W. Ross,et al.  Measuring and Evaluating Large-Scale CDNs , 2008 .

[12]  Ramesh K. Sitaraman,et al.  End-User Mapping: Next Generation Request Routing for Content Delivery , 2015, Comput. Commun. Rev..

[13]  Zhi-Li Zhang,et al.  A first look at inter-data center traffic characteristics via Yahoo! datasets , 2011, 2011 Proceedings IEEE INFOCOM.

[14]  Zhi-Li Zhang,et al.  Characterizing roles of front-end servers in end-to-end performance of dynamic content distribution , 2011, IMC '11.

[15]  Ramesh Govindan,et al.  Reducing web latency: the virtue of gentle aggression , 2013, SIGCOMM.

[16]  David Wetherall,et al.  How Speedy is SPDY? , 2014, NSDI.

[17]  Yuchung Cheng,et al.  TCP fast open , 2011, CoNEXT '11.

[18]  Albert G. Greenberg,et al.  The cost of a cloud: research problems in data center networks , 2008, CCRV.

[19]  Frank Thomson Leighton,et al.  Improving performance on the internet , 2008, CACM.

[20]  Michael Welzl,et al.  An Evaluation of Tail Loss Recovery Mechanisms for TCP , 2015, CCRV.

[21]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.