Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis

Can we get network latency between any two servers at any time in large-scale data center networks? The collected latency data can then be used to address a series of challenges: telling if an application perceived latency issue is caused by the network or not, defining and tracking network service level agreement (SLA), and automatic network troubleshooting. We have developed the Pingmesh system for largescale data center network latency measurement and analysis to answer the above question affirmatively. Pingmesh has been running in Microsoft data centers for more than four years, and it collects tens of terabytes of latency data per day. Pingmesh is widely used by not only network software developers and engineers, but also application and service developers and operators. CCS Concepts •Networks→Network measurement; Cloud computing; Network monitoring; •Computer systems organization → Cloud computing;

[1]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[2]  GhemawatSanjay,et al.  The Google file system , 2003 .

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Michael Isard,et al.  Autopilot: automatic data center management , 2007, OPSR.

[5]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[6]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[7]  Albert G. Greenberg,et al.  The nature of data center traffic: measurements & analysis , 2009, IMC '09.

[8]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[9]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[10]  VL2: a scalable and flexible data center network , 2011, Commun. ACM.

[11]  Geoffrey M. Voelker,et al.  Bullet trains: a study of NIC burst behavior at microsecond timescales , 2013, CoNEXT.

[12]  Srikanth Kandula,et al.  Achieving high utilization with software-driven WAN , 2013, SIGCOMM.

[13]  Albert G. Greenberg,et al.  Ananta: cloud scale load balancing , 2013, SIGCOMM.

[14]  Anees Shaikh,et al.  Virtual network diagnosis as a service , 2013, SoCC.

[15]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[16]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[17]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[18]  George Varghese,et al.  Automatic test packet generation , 2014, CoNEXT '12.