Understanding latency variations of black box services

Data centers run many services that impact millions of users daily. In reality, the latency of each service varies from one request to another. Existing tools allow to monitor services for performance glitches or service disruptions, but typically they do not help understanding the variations in latency. We propose a general framework for understanding performance of arbitrary black box services. We consider a stream of requests to a given service with their monitored attributes, as well as latencies of serving each request. We propose what we call the multi-dimensional f-measure, that helps for a given interval to identify the subset of monitored attributes that explains it. We design algorithms that use this measure not only for a fixed latency interval, but also to explain the entire range of latencies of the service by segmenting it into smaller intervals. We perform a detailed experimental study with synthetic data, as well as real data from a large search engine. Our experiments show that our methods automatically identify significant latency intervals together with request attributes that explain them, and are robust.

[1]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[2]  Richard Mortier,et al.  Magpie: Online Modelling and Performance-aware Systems , 2003, HotOS.

[3]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[4]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[5]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[6]  Gideon S. Mann,et al.  Diagnosing Latency in Multi-Tier Black-Box Services , 2011 .

[7]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[8]  Ailsa H. Land,et al.  An Automatic Method of Solving Discrete Programming Problems , 1960 .

[9]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[10]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[11]  Francesca Odone,et al.  Feature selection for high-dimensional data , 2009, Comput. Manag. Sci..

[12]  Michael J. Freedman,et al.  Experiences with Tracing Causality in Networked Services , 2010, INM/WREN.

[13]  Petros Drineas,et al.  CUR matrix decompositions for improved data analysis , 2009, Proceedings of the National Academy of Sciences.

[14]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[15]  Isabelle Guyon,et al.  Practical Feature Selection: from Correlation to Causality , 2007, NATO ASI Mining Massive Data Sets for Security.

[16]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[17]  Hanoch Levy,et al.  Evaluating web user perceived latency using server side measurements , 2003, Comput. Commun..

[18]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[19]  Vipin Kumar,et al.  Optimizing F-Measure with Support Vector Machines , 2003, FLAIRS Conference.

[20]  Sudipto Guha,et al.  Modeling the Parallel Execution of Black-Box Services , 2011, HotCloud.

[21]  Nan Ye,et al.  Optimizing F-measure: A Tale of Two Approaches , 2012, ICML.

[22]  A. Atkinson Subset Selection in Regression , 1992 .