Investigating the Distributional Property of the Session Workload

Companies now rely on the World Wide Web for communication with their customers. As reliance on web servers grows, the need for companies to better understand the workload placed upon these servers also increases. The session workload unit is a popular unit of measurement used to analyze recorded information from server logs. In fact, many web applications, from shopping carts to online banking systems, require session information to function correctly. Web data mining is also dependent on session workload information. However, the distributional properties of this session workload are not understood. Whether the session workload can be described as a short-tailed or heavy-tailed distribution is a fundamental question for the investigation of the session workload unit. This paper empirically explores claims that the session workload can be described using a heavytailed distribution. The paper concludes that, for the samples used in this paper, a method to accurately determine whether the session workload is drawn from a heavy-tailed distribution does not exist. Hence, the conclusion that they are drawn from such a distribution cannot be made.

[1]  Allen B. Downey,et al.  Evidence for long-tailed distributions in the internet , 2001, IMW '01.

[2]  Vishal Misra,et al.  On the tails of web file size distributions , 2001 .

[3]  Mark Levene,et al.  Associating search and navigation behavior through log analysis: Research Articles , 2005 .

[4]  Katerina Goseva-Popstojanova,et al.  Empirical Characterization of Session–Based Workload and Reliability for Web Servers , 2006, Empirical Software Engineering.

[5]  Katerina Goseva-Popstojanova,et al.  Empirical study of session-based workload and reliability for Web servers , 2004, 15th International Symposium on Software Reliability Engineering.

[6]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[7]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[8]  M. HamidR.Jamali,et al.  The information seeking behaviour of the users of digital scholarly journals , 2006, Inf. Process. Manag..

[9]  G. Box,et al.  On a measure of lack of fit in time series models , 1978 .

[10]  Martin Arlitt,et al.  A workload characterization study of the 1998 World Cup Web site , 2000, IEEE Netw..

[11]  Amanda Spink,et al.  An Analysis of Web Documents Retrieved and Viewed , 2003, International Conference on Internet Computing.

[12]  Carol Tenopir,et al.  What deep log analysis tells us about the impact of big deals: case study OhioLINK , 2006, J. Documentation.

[13]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[14]  Azer Bestavros,et al.  Self-similarity in World Wide Web traffic: evidence and possible causes , 1997, TNET.

[15]  Dennis W. Jansen,et al.  On the Frequency of Large Stock Returns: Putting Booms and Busts into Perspective , 1989 .

[16]  Michel L. Goldstein,et al.  Problems with fitting to the power-law distribution , 2004, cond-mat/0402322.

[17]  Zhao Li,et al.  Evaluating Web software reliability based on workload and failure data extracted from server logs , 2004, IEEE Transactions on Software Engineering.

[18]  Daqing He,et al.  Detecting session boundaries from Web user logs , 2000 .

[19]  David Nicholas,et al.  Evaluating consumer website logs: a case study of The Times/The Sunday Times website , 2000, J. Inf. Sci..

[20]  William J. Reed,et al.  The Double Pareto-Lognormal Distribution—A New Parametric Model for Size Distributions , 2004, WWW 2001.

[21]  James Miller,et al.  Empirical observations on the session timeout threshold , 2009, Inf. Process. Manag..

[22]  Xuan Wang,et al.  A Contribution Towards Solving the Web Workload Puzzle , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[23]  Michael Mitzenmacher,et al.  Dynamic Models for File Sizes and Double Pareto Distributions , 2004, Internet Math..

[24]  Karim Mohammed Rezaul,et al.  A Comparison of Methods for Estimating the Tail Index of Heavy-tailed Internet Traffic , 2007 .

[25]  Myra Spiliopoulou,et al.  A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis , 2003, INFORMS J. Comput..

[26]  Allen B. Downey The structural cause of file size distributions , 2001, SIGMETRICS '01.

[27]  Azer Bestavros,et al.  Changes in Web client access patterns: Characteristics and caching implications , 1999, World Wide Web.

[28]  Sidney I. Resnick,et al.  Pitfalls of Fitting Autoregressive Models for Heavy-Tailed Time Series , 1996 .

[29]  X. Gabaix Zipf's Law for Cities: An Explanation , 1999 .

[30]  Richard A. Davis,et al.  Time Series: Theory and Methods , 2013 .

[31]  Carey L. Williamson,et al.  Internet Web servers: workload characterization and performance implications , 1997, TNET.

[32]  Anja Feldmann,et al.  On TCP and self-similar traffic , 2005, Perform. Evaluation.

[33]  John Panaretos,et al.  Extreme Value Index Estimators and Smoothing Alternatives: Review and Simulation Comparison , 2001 .

[34]  James E. Pitkow,et al.  Characterizing Browsing Strategies in the World-Wide Web , 1995, Comput. Networks ISDN Syst..

[35]  Ludmila Cherkasova,et al.  Session-Based Admission Control: A Mechanism for Peak Load Management of Commercial Web Sites , 2002, IEEE Trans. Computers.

[36]  Sally Jo Cunningham,et al.  A Comparative Transaction Log Analysis of Two Computing Collections , 2000, ECDL.

[37]  Martin F. Arlitt,et al.  Workload characterization of a Web proxy in a cable modem environment , 1999, PERV.

[38]  Allen B. Downey,et al.  Lognormal and Pareto distributions in the Internet , 2005, Comput. Commun..

[39]  Richard A. Davis,et al.  Limit Theory for the Sample Covariance and Correlation Functions of Moving Averages , 1986 .

[40]  Yi-Ting Chen,et al.  On the Robustness of Ljung-Box and McLeod-Li Q Tests: A Simulation Study , 2002 .

[41]  N. Fisher Graphical Methods in Nonparametric Statistics: A Review and Annotated Bibliography , 1983 .

[42]  Alan Pankratz,et al.  Forecasting with univariate Box-Jenkins models : concepts and cases , 1983 .

[43]  Sidney I. Resnick,et al.  Heavy Tail Modelling and Teletraffic Data , 1995 .

[44]  Gennady Samorodnitsky,et al.  Variable heavy tails in Internet traffic , 2004, Perform. Evaluation.

[45]  MAGDALINI EIRINAKI,et al.  Web mining for web personalization , 2003, TOIT.

[46]  M. HamidR.Jamali,et al.  Website usage metrics: A re-assessment of session data , 2008, Inf. Process. Manag..