Contrasting Web Robot and Human Behaviors with Network Models

The web graph is a commonly-used network representation of the hyperlink structure of a website. A network of similar structure to the web graph, which we call the session graph has properties that reflect the browsing habits of the agents in the web server logs. In this paper, we apply session graphs to compare the activity of humans against web robots or crawlers. Understanding these properties will enable us to improve models of HTTP traffic, which can be used to predict and generate realistic traffic for testing and improving web server efficiency, as well as devising new caching algorithms. We apply large-scale network properties, such as the connectivity and degree distribution of human and Web robot session graphs in order to identify characteristics of the traffic which would be useful for modeling web traffic and improving cache performance. We find that the empirical degree distributions of session graphs for human and robot requests on one Web server are best fit by different theoretical distributions, indicating at a difference in the processes which generate the traffic.

[1]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[2]  William J. Reed,et al.  The Double Pareto-Lognormal Distribution—A New Parametric Model for Size Distributions , 2004, WWW 2001.

[3]  Marios D. Dikaiakos,et al.  An investigation of web crawler behavior: characterization and metrics , 2005, Comput. Commun..

[4]  JOSEP DÍAZ,et al.  A survey of graph layout problems , 2002, CSUR.

[5]  Chuan Chuan Zhang,et al.  The Double Pareto-Lognormal Distribution and its applications in actuarial science and finance , 2015 .

[6]  Li Xiang NEW INTERDISCIPLINARY SCIENCE:NETWORK SCIENCE(I) , 2007 .

[7]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[8]  Swapna S. Gokhale,et al.  Web robot detection techniques: overview and limitations , 2010, Data Mining and Knowledge Discovery.

[9]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[10]  Christos Faloutsos,et al.  Mobile call graphs: beyond power-law and lognormal distributions , 2008, KDD.

[11]  Eli Upfal,et al.  The Web as a graph , 2000, PODS.

[12]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[13]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[14]  Walter Willinger,et al.  Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference , 2011, IMC 2011.

[15]  Lada A. Adamic,et al.  Power-Law Distribution of the World Wide Web , 2000, Science.

[16]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[17]  Hyungkyu Lee,et al.  Classification of web robots: An empirical study based on over one billion requests , 2009, Comput. Secur..

[18]  Sebastiano Vigna,et al.  Graph structure in the web --- revisited: a trick of the heavy tail , 2014, WWW.

[19]  Jasleen Kaur,et al.  A Graph Theoretical Analysis of the Web Using DNS Traffic Traces , 2015, 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[20]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[21]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[22]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[23]  Q. Vuong Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses , 1989 .

[24]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Debora Donato,et al.  Large scale properties of the Webgraph , 2004 .

[26]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[27]  Anja Feldmann,et al.  A First Look at Mobile Hand-Held Device Traffic , 2010, PAM.

[28]  Swapna S. Gokhale,et al.  A comparison of Web robot and human requests , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[29]  Mariacarla Calzarossa,et al.  Analysis of Web Logs: Challenges and Findings , 2010, PERFORM.

[30]  M. Jacomy,et al.  ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software , 2014, PloS one.

[31]  W. Stahel,et al.  Log-normal Distributions across the Sciences: Keys and Clues , 2001 .

[32]  Yen-Kuang Chen,et al.  Challenges and opportunities of internet of things , 2012, 17th Asia and South Pacific Design Automation Conference.

[33]  A. Barabasi,et al.  Network science , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[34]  Ted G. Lewis,et al.  Network Science: Theory and Applications , 2009 .

[35]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[36]  Vivek S. Pai,et al.  Towards understanding modern web traffic , 2011, SIGMETRICS '11.

[37]  Om Prakash Vyas,et al.  A Comparative Analysis of Browsing Behavior of Human Visitors and Automatic Software Agents , 2015 .

[38]  Dietmar Plenz,et al.  powerlaw: A Python Package for Analysis of Heavy-Tailed Distributions , 2013, PloS one.

[39]  S. Foss,et al.  An Introduction to Heavy-Tailed and Subexponential Distributions , 2011 .

[40]  Yiqun Liu,et al.  User Browsing Graph: Structure, Evolution and Application , 2009, WSDM.