MimicNet: fast performance estimates for data center networks with machine learning

At-scale evaluation of new data center network innovations is becoming increasingly intractable. This is true for testbeds, where few, if any, can afford a dedicated, full-scale replica of a data center. It is also true for simulations, which while originally designed for precisely this purpose, have struggled to cope with the size of today's networks. This paper presents an approach for quickly obtaining accurate performance estimates for large data center networks. Our system,MimicNet, provides users with the familiar abstraction of a packet-level simulation for a portion of the network while leveraging redundancy and recent advances in machine learning to quickly and accurately approximate portions of the network that are not directly visible. MimicNet can provide over two orders of magnitude speedup compared to regular simulation for a data center with thousands of servers. Even at this scale, MimicNet estimates of the tail FCT, throughput, and RTT are within 5% of the true results.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Junjie Wu,et al.  BigHouse: A simulation infrastructure for data center systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[3]  Thomas E. Anderson,et al.  F10: A Fault-Tolerant Engineered Network , 2013, NSDI.

[4]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[5]  Brighten Godfrey,et al.  Debugging the data plane with anteater , 2011, SIGCOMM.

[6]  Ramesh Govindan,et al.  A General Approach to Network Configuration Analysis , 2015, NSDI.

[7]  Klaus Wehrle,et al.  Enabling Distributed Simulation of OMNeT++ INET Models , 2014, ArXiv.

[8]  Russell J. Clark,et al.  Kinetic: Verifiable Dynamic Network Control , 2015, NSDI.

[9]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[10]  L. Peterson,et al.  TCP Vegas: new techniques for congestion detection and avoidance , 1994, SIGCOMM.

[11]  Gregory Ewing,et al.  Akaroa-2: Exploiting Network Computing by Distributing Stochastic Simulation , 1999 .

[12]  Vishal Misra,et al.  Fluid-based analysis of a network of AQM routers supporting TCP flows with an application to RED , 2000, SIGCOMM.

[13]  Brighten Godfrey,et al.  VeriFlow: verifying network-wide invariants in real time , 2012, HotSDN '12.

[14]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[15]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[16]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[17]  Thomas R. Henderson,et al.  Network Simulations with the ns-3 Simulator , 2008 .

[18]  Rob Sherwood,et al.  Can the Production Network Be the Testbed? , 2010, OSDI.

[19]  Hong Liu,et al.  Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  Haitao Wu,et al.  RDMA over Commodity Ethernet at Scale , 2016, SIGCOMM.

[22]  Lyle H. Ungar,et al.  Fast Network Simulation Through Approximation or: How Blind Men Can Describe Elephants , 2018, HotNets.

[23]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[24]  Amin Vahdat,et al.  NetBump: User-extensible active queue management with bumps on the wire , 2012, 2012 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[25]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[26]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[27]  Ryan Beckett,et al.  Aragog: Scalable Runtime Verification of Shardable Networked Systems , 2020, OSDI.

[28]  Eric Eide,et al.  Introducing CloudLab: Scientific Infrastructure for Advancing Cloud Architectures and Applications , 2014, login Usenix Mag..

[29]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[30]  Nick McKeown,et al.  A network in a laptop: rapid prototyping for software-defined networks , 2010, Hotnets-IX.

[31]  Hari Balakrishnan,et al.  Flexplane: An Experimentation Platform for Resource Management in Datacenters , 2017, NSDI.

[32]  Walter Willinger,et al.  On the self-similar nature of Ethernet traffic , 1993, SIGCOMM '93.

[33]  Ren Wang,et al.  TCP westwood: Bandwidth estimation for enhanced transport over wireless links , 2001, MobiCom '01.

[34]  Xi Chen,et al.  DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs , 2015, ASPLOS.

[35]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[36]  A. Varga,et al.  THE OMNET++ DISCRETE EVENT SIMULATION SYSTEM , 2003 .

[37]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..

[38]  Amin Vahdat,et al.  ModelNet: Towards a datacenter emulation environment , 2009, 2009 IEEE Ninth International Conference on Peer-to-Peer Computing.

[39]  Henri Casanova,et al.  Versatile, scalable, and accurate simulation of distributed applications and platforms , 2014, J. Parallel Distributed Comput..

[40]  Gautam Kumar,et al.  Swift: Delay is Simple and Effective for Congestion Control in the Datacenter , 2020, SIGCOMM.

[41]  Xin Yang,et al.  Canaries in the Network , 2016, HotNets.

[42]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[43]  Jialin Li,et al.  Designing Distributed Systems Using Approximate Synchrony in Data Center Networks , 2015, NSDI.

[44]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[45]  Jitendra Padhye,et al.  CrystalNet: Faithfully Emulating Large Production Networks , 2017, SOSP.

[46]  Mark Handley,et al.  Improving datacenter performance and robustness with multipath TCP , 2011, SIGCOMM.

[47]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[48]  Mark Handley,et al.  Design, Implementation and Evaluation of Congestion Control for Multipath TCP , 2011, NSDI.

[49]  Philip Levis,et al.  Pantheon: the training ground for Internet congestion-control research , 2018, USENIX Annual Technical Conference.

[50]  David Mazières,et al.  EyeQ: Practical Network Performance Isolation for the Multi-tenant Cloud , 2012, HotCloud.

[51]  Amin Vahdat,et al.  Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center , 2012, NSDI.