Modeling the Parallel Execution of Black-Box Services

Services running in a data center frequently rely on RPCs to child services (e.g. storage, cache, authentication), and their latency depends crucially on latencies of those RPCs. However, even though service latency often comes exclusively from the time spent inside remote calls, it is difficult to determine parent latency since multithreading and asynchronous RPCs lead to complex and non-linear dependencies between service and RPC latencies. In this paper, we present a model that can be used to estimate parent latency given RPC latencies, where the parallel dependencies among of child services are modeled by an "execution flow", a direct acyclic graph. The model is learned from samples collected by a distributed tracing tool. Experiments demonstrate that these models are better able to predict top-level parent latency from child latency than state-of-the-art baselines such as linear regression and critical path analysis.

[1]  Alex Aiken,et al.  Cooperative Bug Isolation , 2007 .

[2]  S. Ranjan,et al.  QoS-driven server migration for Internet data centers , 2002, IEEE 2002 Tenth IEEE International Workshop on Quality of Service (Cat. No.02EX564).

[3]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[4]  Richard Mortier,et al.  Magpie: Online Modelling and Performance-aware Systems , 2003, HotOS.

[5]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[6]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[7]  Benny Rochwerger,et al.  Oceano-SLA based management of a computing utility , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[8]  Guillaume Pierre,et al.  Autonomous resource provisioning for multi-service web applications , 2010, WWW '10.

[9]  Richard Wolski,et al.  Dynamically forecasting network performance using the Network Weather Service , 1998, Cluster Computing.

[10]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[11]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[12]  Asser N. Tantawi,et al.  An analytical model for multi-tier internet services and its applications , 2005, SIGMETRICS '05.

[13]  Julio César López-Hernández,et al.  Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.

[14]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.