Cloud-Scale Application Performance Monitoring with SDN and NFV

In cloud data centers, more and more services are deployed across multiple tiers to increase flexibility and scalability. However, this makes it difficult for the cloud provider to identify which tier of the application is the bottleneck and how to resolve performance problems. Existing solutions approach this problem by constantly monitoring either in end-hosts or physical switches. Host based monitoring usually needs instrumentation of application code, making it less practical, while network hardware based monitoring is expensive and requires special features in each physical switch. Instead, we believe network wide monitoring should be flexible and easy to deploy in a non-intrusive way by exploiting recent advances in software-based network services. Towards this end we are developing a distributed software-based network monitoring framework for cloud data centers. Our system leverages knowledge of topology and routing information to build relationships between each tier of the application, and detect and locate performance bottlenecks by monitoring the network inside software switches.

[1]  Minlan Yu,et al.  Profiling Network Performance for Multi-tier Data Center Applications , 2011, NSDI.

[2]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[3]  Fulvio Risso,et al.  Supporting Fine-Grained Network Functions through Intel DPDK , 2014, 2014 Third European Workshop on Software Defined Networks.

[4]  K. K. Ramakrishnan,et al.  OpenNetVM: A Platform for High Performance Network Service Chains , 2016, HotMiddlebox@SIGCOMM.

[5]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[6]  Mohammad Hosseini,et al.  R-Storm: Resource-Aware Scheduling in Storm , 2015, Middleware.

[7]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[8]  Michio Honda,et al.  mSwitch: a highly-scalable, modular software switch , 2015, SOSR.

[9]  Roberto Bifulco,et al.  ClickOS and the Art of Network Function Virtualization , 2014, NSDI.

[10]  Vijay Mann,et al.  Living on the edge: Monitoring network flows at the edge in cloud data centers , 2013, 2013 Fifth International Conference on Communication Systems and Networks (COMSNETS).

[11]  Rodrigo Fonseca,et al.  Planck: millisecond-scale monitoring and control for commodity networks , 2015, SIGCOMM 2015.

[12]  Ying Zhang,et al.  SmartRelationship: a VM relationship detection framework for cloud management , 2014, Internetware.

[13]  J. Carter,et al.  OpenSample : A Low-Latency , Sampling-Based Measurement Platform for SDN , 2013 .

[14]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[15]  Rodrigo Fonseca,et al.  Planck , 2014, SIGCOMM.

[16]  Luigi Rizzo,et al.  netmap: A Novel Framework for Fast Packet I/O , 2012, USENIX ATC.

[17]  Jianfeng Zhan,et al.  Precise, Scalable, and Online Request Tracing for Multitier Services of Black Boxes , 2012, IEEE Transactions on Parallel and Distributed Systems.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Scott Shenker,et al.  E2: a framework for NFV applications , 2015, SOSP.

[20]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[21]  Ramesh Govindan,et al.  Scalable Rule Management for Data Centers , 2013, NSDI.

[22]  Marcos K. Aguilera,et al.  WAP5: black-box performance debugging for wide-area systems , 2006, WWW '06.

[23]  Ted Taekyoung Kwon,et al.  OpenSample: A Low-Latency, Sampling-Based Measurement Platform for Commodity SDN , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[24]  Myungjin Lee,et al.  CherryPick: tracing packet trajectory in software-defined datacenter networks , 2015, SOSR.

[25]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[26]  Michael Zink,et al.  Characteristics of YouTube network traffic at a campus network - Measurements, models, and implications , 2009, Comput. Networks.

[27]  Gautam Kar,et al.  Application Performance Management in Virtualized Server Environments , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.

[28]  Rodrigo Fonseca,et al.  Pivot tracing , 2018, USENIX ATC.

[29]  Rajeev Gandhi,et al.  Performance troubleshooting in data centers: an annotated bibliography? , 2013, OPSR.

[30]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[31]  Anja Feldmann,et al.  OFRewind: Enabling Record and Replay Troubleshooting for Networks , 2011, USENIX Annual Technical Conference.

[32]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[33]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[34]  Christos Gkantsidis,et al.  Enabling End-Host Network Functions , 2015, Comput. Commun. Rev..

[35]  Ivan Beschastnikh,et al.  NetCheck: Network Diagnoses from Blackbox Traces , 2014, NSDI.

[36]  Yuping Zhao,et al.  OpenANFV: accelerating network function virtualization with a consolidated framework in openstack , 2015, SIGCOMM 2015.

[37]  Minlan Yu,et al.  Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[38]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[39]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[40]  Nick McKeown,et al.  Leveraging SDN layering to systematically troubleshoot networks , 2013, HotSDN '13.

[41]  Timothy Wood,et al.  Topology Discovery and Service Classification for Distributed-Aware Clouds , 2014, 2014 IEEE International Conference on Cloud Engineering.

[42]  K. K. Ramakrishnan,et al.  NetVM: High Performance and Flexible Networking Using Virtualization on Commodity Platforms , 2014, IEEE Transactions on Network and Service Management.

[43]  Ming Zhang,et al.  MicroTE: fine grained traffic engineering for data centers , 2011, CoNEXT '11.

[44]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[45]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[46]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[47]  George Varghese,et al.  Building a better NetFlow , 2004, SIGCOMM.