YTrace: End-to-end Performance Diagnosis in Large Cloud and Content Providers

Content providers build serving stacks to deliver content to users. An important goal of a content provider is to ensure good user experience, since user experience has an impact on revenue. In this paper, we describe a system at Yahoo called YTrace that diagnoses bad user experience in near real time. We present the different components of YTrace for end-to-end multi-layer diagnosis (instrumentation, methods and backend system), and the system architecture for delivering diagnosis in near real time across all user sessions at Yahoo. YTrace diagnoses problems across service and network layers in the end-to-end path spanning user host, Internet, CDN and the datacenters, and has three diagnosis goals: detection, localization and root cause analysis (including cascading problems) of performance problems in user sessions with the cloud. The key component of the methods in YTrace is capturing and discovering causality, which we design based on a mix of instrumentation API, domain knowledge and blackbox methods. We show three case studies from production that span a large-scale distributed storage system, a datacenter-wide network, and an end-to-end video serving stack at Yahoo. We end by listing a number of open directions for performance diagnosis in cloud and content providers.

[1]  Minlan Yu,et al.  Profiling Network Performance for Multi-tier Data Center Applications , 2011, NSDI.

[2]  Rahul Potharaju,et al.  When the network crumbles: an empirical study of cloud network failures and their impact on services , 2013, SoCC.

[3]  Chen Liang,et al.  Finding Needles in the Haystack: Harnessing Syslogs for Data Center Management , 2016, ArXiv.

[4]  David Wetherall,et al.  Demystifying Page Load Performance with WProf , 2013, NSDI.

[5]  Jie Gao,et al.  Moving beyond end-to-end path information to optimize CDN performance , 2009, IMC '09.

[6]  Navendu Jain,et al.  Demystifying the dark side of the middle: a field study of middlebox failures in datacenters , 2013, Internet Measurement Conference.

[7]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[8]  Partha Kanuparthy,et al.  Pythia: Diagnosing Performance Problems in Wide Area Providers , 2014, USENIX ATC.

[9]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[10]  Rodrigo Fonseca,et al.  Pivot tracing , 2018, USENIX ATC.

[11]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[12]  Partha Kanuparthy,et al.  Performance Characterization of a Commercial Video Streaming Service , 2016, Internet Measurement Conference.

[13]  Jennifer Rexford,et al.  LatLong: Diagnosing Wide-Area Latency Changes for CDNs , 2012, IEEE Transactions on Network and Service Management.

[14]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[15]  Jennifer Neville,et al.  Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[16]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[17]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[18]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[19]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[20]  Katerina J. Argyraki,et al.  Network neutrality inference , 2014, SIGCOMM.

[21]  Benjamin Livshits,et al.  AjaxScope: a platform for remotely monitoring the client-side behavior of web 2.0 applications , 2007, TWEB.

[22]  Partha Kanuparthy End-to-end inference of internet performance problems , 2012 .

[23]  Sam Shah,et al.  Root cause detection in a service-oriented architecture , 2013, SIGMETRICS '13.

[24]  Michael Mitzenmacher,et al.  How useful is old information (extended abstract)? , 1997, PODC '97.

[25]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[26]  Partha Kanuparthy,et al.  Diagnosing Performance Problems in Wide Area Providers , 2014 .

[27]  Jennifer Rexford,et al.  Real-time diagnosis of TCP performance in clouds , 2013, CoNEXT Student Workhop '13.

[28]  Shengli Pan,et al.  End-to-End Measurements for Network Tomography under Multipath Routing , 2014, IEEE Communications Letters.

[29]  Paul Barford,et al.  Multiobjective Monitoring for SLA Compliance , 2010, IEEE/ACM Transactions on Networking.

[30]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[31]  J. Flinn,et al.  Automatic Root-cause Diagnosis of Performance Anomalies in Production Software , 2011 .

[32]  Minlan Yu,et al.  Identifying performance bottlenecks in CDNs through TCP-level monitoring , 2011, W-MUST '11.

[33]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[34]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[35]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[36]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[37]  Nick G. Duffield,et al.  Simple network performance tomography , 2003, IMC '03.

[38]  Dan Pei,et al.  What happened in my network: mining network events from router syslogs , 2010, IMC '10.

[39]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[40]  Carlo Curino,et al.  WANalytics: Analytics for a Geo-Distributed Data-Intensive World , 2015, CIDR.

[41]  Yu Luo,et al.  lprof: A Non-intrusive Request Flow Profiler for Distributed Systems , 2014, OSDI.