Modeling performance of internet-based services using causal reasoning

The performance of Internet-based services depends on many server-side, client-side, and network related factors. Often, the interaction among the factors or their effect on service performance is not known or well-understood. The complexity of these services makes it difficult to develop analytical models. Lack of models impedes network management tasks, such as predicting performance while planning for changes to service infrastructure, or diagnosing causes of poor performance. We posit that we can use statistical causal methods to model performance for Internet-based services and facilitate performance related network management tasks. Internet-based services are well-suited for statistical learning because the inherent variability in many factors that affect performance allows us to collect comprehensive datasets that cover service performance under a wide variety of conditions. These conditional distributions represent the functions that govern service performance and dependencies that are inherent in the service infrastructure. These functions and dependencies are accurate and can be used in lieu of analytical models to reason about system performance, such as predicting performance of a service when changing some factors, finding causes of poor performance, or isolating contribution of individual factors in observed performance. We present three systems, What-if Scenario Evaluator (WISE), How to Improve Performance (HIP), and Network Access Neutrality Observatory (NANO), that use statistical causal methods to facilitate network management tasks. WISE predicts performance for what-if configurations and deployment questions for content distribution networks. For this, WISE learns the causal dependency structure among the latency-causing factors, and when one or more factors is changed, WISE estimates effect on other factors using the dependency structure. HIP extends WISE and uses the causal dependency structure to invert the performance function, find causes of poor performance, and help answers questions about how to improve performance or achieve performance goals. NANO uses causal inference to quantify the impact of discrimination policies of ISPs on service performance. NANO is the only tool to date for detecting destination-based discrimination techniques that ISPs may use. We have evaluated these tools by application to large-scale Internet-based services and by experiments on wide-area Internet. WISE is actively used at Google for predicting network-level and browser-level response time for Web search for new datacenter deployments. We have used HIP to find causes of high-latency Web search transactions in Google, and identified many cases where high-latency transactions can be significantly mitigated with simple infrastructure changes. We have evaluated NANO using experiments on wide-area Internet and also made the tool publicly available to recruit users and deploy NANO at a global scale.

[1]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM.

[2]  AmmarMostafa,et al.  Answering what-if deployment and configuration questions with wise , 2008 .

[3]  Jie Gao,et al.  Moving beyond end-to-end path information to optimize CDN performance , 2009, IMC '09.

[4]  P. Spirtes,et al.  An Algorithm for Fast Recovery of Sparse Causal Graphs , 1991 .

[5]  Tadayoshi Kohno,et al.  Detecting In-Flight Page Changes with Web Tripwires , 2008, NSDI.

[6]  Santosh S. Vempala,et al.  A divide-and-merge methodology for clustering , 2005, PODS '05.

[7]  David Mazières,et al.  Democratizing Content Publication with Coral , 2004, NSDI.

[8]  Albert G. Greenberg,et al.  IP fault localization via risk modeling , 2005, NSDI.

[9]  Christopher Hitchcock,et al.  Do All and Only Causes Raise the Probabilities of Effects , 2004 .

[10]  Albert G. Greenberg,et al.  WebProphet: Automating Performance Prediction for Web Services , 2010, NSDI.

[11]  Ming Zhang,et al.  Uncovering Performance Differences Among Backbone ISPs with Netdiff , 2008, NSDI.

[12]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[13]  John R. Wolberg,et al.  Data Analysis Using the Method of Least Squares: Extracting the Most Information from Experiments , 2005 .

[14]  Brad Cain,et al.  Known Content Network (CN) Request-Routing Mechanisms , 2003, RFC.

[15]  Chuanhai Liu,et al.  Adaptive Thresholds , 2006 .

[16]  Nick Feamster,et al.  Characterizing VLAN-induced sharing in a campus network , 2009, IMC '09.

[17]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[18]  Sheng Ma,et al.  Efficient fault diagnosis using probing , 2002 .

[19]  Arun Venkataramani,et al.  iPlane: an information plane for distributed services , 2006, OSDI '06.

[20]  David E. Culler,et al.  Operating Systems Support for Planetary-Scale Network Services , 2004, NSDI.

[21]  Donald F. Towsley,et al.  Modeling TCP throughput: a simple model and its empirical validation , 1998, SIGCOMM '98.

[22]  N. Jewell Statistics for Epidemiology , 2003 .

[23]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[24]  Robert Beverly,et al.  The Internet Is Not a Big Truck: Toward Quantifying Network Neutrality , 2007, PAM.

[25]  Paul Barford,et al.  A Machine Learning Approach to TCP Throughput Prediction , 2007, IEEE/ACM Transactions on Networking.

[26]  Renata Teixeira,et al.  NetDiagnoser: troubleshooting network unreachabilities using end-to-end probes and routing data , 2007, CoNEXT '07.

[27]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[28]  Stefan Savage,et al.  Modeling TCP latency , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[29]  Balachander Krishnamurthy,et al.  Predicting short-transfer latency from TCP arcana: a trace-based validation , 2005, IMC '05.

[30]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[31]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[32]  Ian Clarke,et al.  A Distributed Decentralised Information Storage and Retrieval System , 1999 .

[33]  Srikanth Kandula,et al.  Shrink: a tool for failure diagnosis in IP networks , 2005, MineNet '05.

[34]  Ming Zhang,et al.  Ascertaining the Reality of Network Neutrality Violation in Backbone ISPs , 2008, HotNets.

[35]  Qi He,et al.  On the predictability of large transfer TCP throughput , 2005, SIGCOMM '05.

[36]  Stuart Barber,et al.  All of Statistics: a Concise Course in Statistical Inference , 2005 .

[37]  Marcel Dischinger,et al.  Detecting bittorrent blocking , 2008, IMC '08.