Remembering what we like: Toward an agent-based model of Web traffic

Analysis of aggregate Web traffic has shown that PageRank is a poor model of how people actually navigate the Web. Using the empirical traffic patterns generated by a thousand users over the course of two months, we characterize the properties of Web traffic that cannot be reproduced by Markovian models, in which destinations are independent of past decisions. In particular, we show that the diversity of sites visited by individual users is smaller and more broadly distributed than predicted by the PageRank model; that link traffic is more broadly distributed than predicted; and that the time between consecutive visits to the same site by a user is less broadly distributed than predicted. To account for these discrepancies, we introduce a more realistic navigation model in which agents maintain individual lists of bookmarks that are used as teleportation targets. The model can also account for branching, a traffic property caused by browser features such as tabs and the back button. The model reproduces aggregate traffic patterns such as site popularity, while also generating more accurate predictions of diversity, link traffic, and return time distributions. This model for the first time allows us to capture the extreme heterogeneity of aggregate traffic measurements while explaining the more narrowly focused browsing patterns of individual users.

[1]  Filippo Menczer,et al.  What's in a session: tracking individual behavior on the web , 2009, HT '09.

[2]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[3]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[4]  Hyoung-Kee Choi,et al.  A behavioral model of Web traffic , 1999, Proceedings. Seventh International Conference on Network Protocols.

[5]  Marián Boguñá,et al.  Approximating PageRank from In-Degree , 2007, WAW.

[6]  Paul Barford,et al.  Modeling, measurement and performance of world wide web transactions , 2001 .

[7]  Filippo Menczer,et al.  On the lack of typical behavior in the global Web traffic network , 2005, WWW '05.

[8]  Junghoo Cho,et al.  Impact of search engines on page popularity , 2004, WWW '04.

[9]  Bruno Gonçalves,et al.  Human dynamics revealed through Web analytics , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  Mark Crovella,et al.  Self - similarity in World Wide Web: Evidence and possible causes , 1997 .

[11]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[12]  Filip Radlinski,et al.  Active exploration for learning rankings from clickthrough data , 2007, KDD '07.

[13]  Tie-Yan Liu,et al.  BrowseRank: letting web users vote for page importance , 2008, SIGIR '08.

[14]  Jin Cao,et al.  Internet Traffic Tends Toward Poisson and Independent as the Load Increases , 2003 .

[15]  Anja Feldmann,et al.  Characteristics of TCP Connection Arrivals , 2002 .

[16]  Santo Fortunato,et al.  Ranking web sites with real user traffic , 2008, WSDM '08.

[17]  Akira Kato,et al.  Traffic Data Repository at the WIDE Project , 2000, USENIX Annual Technical Conference, FREENIX Track.

[18]  Bruce A. Mah,et al.  An empirical model of HTTP network traffic , 1997, Proceedings of INFOCOM '97.

[19]  Fabien Mathieu,et al.  BackRank: an alternative for PageRank? , 2005, WWW '05.

[20]  Santo Fortunato,et al.  Scale-free network growth by ranking. , 2006, Physical review letters.

[21]  A Vespignani,et al.  Topical interests and the mitigation of search engine bias , 2006, Proceedings of the National Academy of Sciences.

[22]  Robert T. Braden,et al.  Requirements for Internet Hosts - Communication Layers , 1989, RFC.

[23]  Joachim Charzinski Measured HTTP performance and fun factors , 2001 .

[24]  Vladimir A. Bolotin Modeling call holding time distributions for CCS network design and performance analysis , 1994, IEEE J. Sel. Areas Commun..

[25]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[26]  Eduardo Casilari-Pérez,et al.  Characterisation of Web traffic , 2001, GLOBECOM'01. IEEE Global Telecommunications Conference (Cat. No.01CH37270).

[27]  Azer Bestavros,et al.  Explaining World Wide Web Traffic Self-Similarity , 1995 .