Analysis of Web Workloads Using the Bootstrap Methodology

Modeling the performance of caches requires examining their behavior when subjected to realistic workloads that are created by extracting the essential characteristics of actual Web traffic. We have used the Bootstrap methodology for analyzing response size and file popularity distributions obtained from proxy log files. The Bootstrap is a resampling technique that enables one to estimate both population parameters and their standard errors when no accurate mathematical representations are available for the underlying distributions. For this investigation we used real-world workloads from NLANR and Verizon Laboratories together with a synthetic workload generated by Web Polygraph. The analysis of response size included averages and percentiles, processing and bandwidth costs, and a null hypothesis test for ascertaining the likelihood that different datasets have the same underlying distribution. The file popularity distribution was investigated to assess the degree of Zipf-like behavior. We conclude that the Bootstrap is an effective tool for the analysis of nonparametric distributions that enables one to determine confidence levels and make stronger statements than possible using conventional statistics. It is also valuable for determining how accurately a synthetic workload represents actual traffic.

[1]  Mark E. Crovella,et al.  Effect of traffic self-similarity on network performance , 1997, Other Conferences.

[2]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[3]  Mark Crovella,et al.  Internet performance modeling: the state of the art at the turn of the century , 2000, Perform. Evaluation.

[4]  B. Efron Computers and the Theory of Statistics: Thinking the Unthinkable , 1979 .

[5]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[6]  Jussi Kangasharju,et al.  Performance evaluation of redirection schemes in content distribution networks , 2001, Comput. Commun..

[7]  Martin F. Arlitt,et al.  Improving Proxy Cache Performance: Analysis of Three Replacement Policies , 1999, IEEE Internet Comput..

[8]  Paul Barford,et al.  Changes in Web Client Access Patterns , 1998, The Web Conference.

[9]  Sanjoy Paul,et al.  Distributed caching with centralized control , 2001, Comput. Commun..

[10]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.