Structural mining of large-scale behavioral data from the internet

As the Internet becomes ever more pervasive in the lives of hundreds of millions of people, our understanding of its physical structure has outpaced our understanding of the dynamic patterns of traffic generated by its users. This work aims to develop a better understanding of the structure of Internet traffic in a manner consistent with individual privacy and computational constraints. I first examine network flow data from the Internet2 network, using it to form “behavioral networks” based on the flows attributable to specific network applications. The heavy-tailed distributions associated with these networks suggest unbounded variance and poorly defined means in distributions of user behavior. However, a novel application of hierarchical clustering to similarity data derived from these networks makes it possible to classify network applications robustly based on their observed behavior. I then focus on Web traffic, using a large collection of HTTP request data to build a weighted subset of the Web graph. Analysis of this weighted graph reveals more heavy-tailed distributions and the presence of a large body of stationary traffic. The traffic data are also shown to contradict key assumptions of the random surfer model used by PageRank. I conclude with the development of ABC, an behaviorally plausible agent-based model of Web traffic that incorporates backtracking, bookmarks, and a sense of topical locality. The ABC model is shown to approximate real user activity more accurately than PageRank on both artificial and empirically generated graphs.

[1]  Krishna Bharat,et al.  Who links to whom: mining linkage between Web sites , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[2]  Filippo Menczer,et al.  Lexical and semantic clustering by Web links , 2004, J. Assoc. Inf. Sci. Technol..

[3]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[4]  Lada A. Adamic,et al.  Power-Law Distribution of the World Wide Web , 2000, Science.

[5]  R Pastor-Satorras,et al.  Dynamical and correlation properties of the internet. , 2001, Physical review letters.

[6]  Renata Teixeira,et al.  Early application identification , 2006, CoNEXT '06.

[7]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[8]  Anirban Mahanti,et al.  Traffic classification using clustering algorithms , 2006, MineNet '06.

[9]  Filippo Menczer,et al.  ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery , 1997, ICML 1997.

[10]  Mark Crovella,et al.  Characterization of network-wide anomalies in traffic flows , 2004, IMC '04.

[11]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[12]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[13]  Ibrahim Matta,et al.  BRITE: A Flexible Generator of Internet Topologies , 2000 .

[14]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[15]  Allen B. Downey,et al.  Evidence for long-tailed distributions in the internet , 2001, IMW '01.

[16]  Vinton G. Cerf,et al.  A brief history of the internet , 1999, CCRV.

[17]  Gordon Bell,et al.  Ethernet: Distributed Packet Switching for Local Computer Networks , 1976 .

[18]  Filippo Menczer,et al.  Growing and navigating the small world Web by local content , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Ian T. Foster,et al.  Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer Systems and Implications for System Design , 2002, ArXiv.

[20]  Filippo Menczer,et al.  Remembering what we like: Toward an agent-based model of Web traffic , 2009, WSDM.

[21]  Peng Xie,et al.  Sampling biases in IP topology measurements , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[22]  Allen B. Downey,et al.  Lognormal and Pareto distributions in the Internet , 2005, Comput. Commun..

[23]  Lada A. Adamic,et al.  The Web's hidden order , 2001, CACM.

[24]  Gregory D. Abowd,et al.  Workload of a Media-Enhanced Classroom Server , 2000 .

[25]  Marcin Sydow Can link analysis tell us about web traffic? , 2005, WWW '05.

[26]  kc claffy Internet measurement and data analysis: topology, workload, performance and routing statistics , 1999 .

[27]  Qiang Yang,et al.  Web-page summarization using clickthrough data , 2005, SIGIR '05.

[28]  kc claffy,et al.  Measurements of the Internet topology in the Asia-Pacific Region , 2000 .

[29]  Michalis Faloutsos,et al.  BLINC: multilevel traffic classification in the dark , 2005, SIGCOMM '05.

[30]  Stephanie Forrest,et al.  Infect Recognize Destroy , 1996 .

[31]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[32]  Donald F. Towsley,et al.  Modeling malware spreading dynamics , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[33]  Ricardo A. Baeza-Yates,et al.  Web Structure, Dynamics and Page Quality , 2002, SPIRE.

[34]  Andy Cockburn,et al.  What do web users do? An empirical analysis of web use , 2001, Int. J. Hum. Comput. Stud..

[35]  David Moore,et al.  Code-Red: a case study on the spread and victims of an internet worm , 2002, IMW '02.

[36]  S. Bornholdt,et al.  Scale-free topology of e-mail networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[37]  Gerhard Weikum,et al.  Query-Log Based Authority Analysis for Web Information Search , 2004, WISE.

[38]  Stephanie Forrest,et al.  Email networks and the spread of computer viruses. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[39]  Ludmila Cherkasova,et al.  Analysis of Enterprise Media Server Workloads : Access Patterns , Locality , Dynamics , and Rate of Change , 2002 .

[40]  Saul Greenberg,et al.  How people revisit web pages: empirical findings and implications for the design of history systems , 1997, Int. J. Hum. Comput. Stud..

[41]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[42]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[43]  José Ignacio Alvarez-Hamelin,et al.  A low complexity visualization tool that helps to perform complex systems analysis , 2008 .

[44]  Zhenyu Liu,et al.  Analysis of User Web Traffic with A Focus on Search Activities , 2005, WebDB.

[45]  Ludmila Cherkasova,et al.  Analysis of enterprise media server workloads: access patterns, locality, content evolution, and rates of change , 2004, IEEE/ACM Transactions on Networking.

[46]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[47]  A Vespignani,et al.  Topical interests and the mitigation of search engine bias , 2006, Proceedings of the National Academy of Sciences.

[48]  Mark Crovella,et al.  Diagnosing network-wide traffic anomalies , 2004, SIGCOMM '04.

[49]  George Varghese,et al.  Automated Worm Fingerprinting , 2004, OSDI.

[50]  Bruno Gonçalves,et al.  Human dynamics revealed through Web analytics , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[51]  Kevin Borders,et al.  Web tap: detecting covert web traffic , 2004, CCS '04.

[52]  Christopher Leckie,et al.  Unsupervised Anomaly Detection in Network Intrusion Detection Using Clusters , 2005, ACSC.

[53]  Susan T. Dumais,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.

[54]  Martin F. Arlitt,et al.  Web server workload characterization: the search for invariants , 1996, SIGMETRICS '96.

[55]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[56]  Marián Boguñá,et al.  Decoding the structure of the WWW: A comparative analysis of Web crawls , 2007, TWEB.

[57]  Alessandro Vespignani,et al.  Large scale networks fingerprinting and visualization using the k-core decomposition , 2005, NIPS.

[58]  James E. Pitkow,et al.  Characterizing Browsing Strategies in the World-Wide Web , 1995, Comput. Networks ISDN Syst..

[59]  Christos H. Papadimitriou,et al.  Heuristically Optimized Trade-Offs: A New Paradigm for Power Laws in the Internet , 2002, ICALP.

[60]  Susan T. Dumais,et al.  Large scale analysis of web revisitation patterns , 2008, CHI.

[61]  Carsten Lund,et al.  Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications , 2004, IMC '04.

[62]  kc claffy,et al.  Internet topology: connectivity of IP graphs , 2001, SPIE ITCom.

[63]  Michalis Faloutsos,et al.  Power laws and the AS-level internet topology , 2003, TNET.

[64]  Heiko Rieger,et al.  Random walks on complex networks. , 2004, Physical review letters.

[65]  Donald F. Towsley,et al.  Email worm modeling and defense , 2004, Proceedings. 13th International Conference on Computer Communications and Networks (IEEE Cat. No.04EX969).

[66]  Sugih Jamin,et al.  Inet: Internet Topology Generator , 2000 .

[67]  Filippo Menczer,et al.  Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web , 2000, Machine Learning.

[68]  George Varghese,et al.  Automatically inferring patterns of resource consumption in network traffic , 2003, SIGCOMM '03.

[69]  Huberman,et al.  Strong regularities in world wide web surfing , 1998, Science.

[70]  Walter Willinger,et al.  The origin of power laws in Internet topologies revisited , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[71]  B. Huberman,et al.  Social Dilemmas and Internet Congestions , 1997 .

[72]  James J. Kistler,et al.  Building a Cloud for Yahoo! , 2009, IEEE Data Eng. Bull..

[73]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[74]  Walter Willinger,et al.  Towards capturing representative AS-level Internet topologies , 2004, Comput. Networks.

[75]  Alfred O. Hero,et al.  Manifold learning visualization of network traffic data , 2005, MineNet '05.

[76]  Paul Erdös,et al.  On random graphs, I , 1959 .

[77]  Santo Fortunato,et al.  Random Walks on Directed Networks: the Case of PageRank , 2007, Int. J. Bifurc. Chaos.

[78]  Carey L. Williamson,et al.  Identifying and discriminating between web and peer-to-peer traffic in the network core , 2007, WWW '07.

[79]  Susan T. Dumais,et al.  Resonance on the web: web dynamics and revisitation patterns , 2009, CHI.

[80]  Thomas Beauvisage The dynamics of personal territories on the web , 2009, Hypertext.

[81]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[82]  Marián Boguñá,et al.  Approximating PageRank from In-Degree , 2007, WAW.

[83]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[84]  Marc Barthelemy,et al.  Spatial structure of the internet traffic , 2003 .

[85]  Vasant Honavar,et al.  Intelligent agents for intrusion detection , 1998, 1998 IEEE Information Technology Conference, Information Environment for the Future (Cat. No.98EX228).

[86]  Qiang Yang,et al.  Web-Log Mining for Predictive Web Caching , 2003, IEEE Trans. Knowl. Data Eng..

[87]  Fan Chung Graham,et al.  A random graph model for massive graphs , 2000, STOC '00.

[88]  Filippo Menczer,et al.  Mapping the semantics of Web text and links , 2005, IEEE Internet Computing.

[89]  Ulrich Meyer,et al.  Algorithms and Experiments for the Webgraph , 2003, ESA.

[90]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[91]  A. Barabasi,et al.  Scale-free characteristics of random networks: the topology of the world-wide web , 2000 .

[92]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[93]  Sebastiano Vigna,et al.  PageRank as a function of the damping factor , 2005, WWW '05.

[94]  A. Barabasi,et al.  Global organization of metabolic fluxes in the bacterium Escherichia coli , 2004, Nature.

[95]  Steve Uhlig,et al.  The Macroscopic Behavior of Internet Traffic: a Comparative Study , 2007 .

[96]  Alessandro Vespignani,et al.  Exploring networks with traceroute-like probes: theory and simulations , 2004, Theor. Comput. Sci..

[97]  Alessandro Vespignani,et al.  The spectrum of Internet performance , 2003 .

[98]  Helen Nissenbaum,et al.  Defining the Web: The Politics of Search Engines , 2000, Computer.

[99]  Alessandro Vespignani,et al.  Epidemic spreading in scale-free networks. , 2000, Physical review letters.

[100]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[101]  Jack Koziol Intrusion Detection with Snort , 2003 .

[102]  O. C. Herfindahl Copper costs and prices: 1870-1957 , 1959 .

[103]  Ravi Kumar,et al.  Self-similarity in the web , 2001, TOIT.

[104]  Tie-Yan Liu,et al.  BrowseRank: letting web users vote for page importance , 2008, SIGIR '08.

[105]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[106]  A. Hirschman THE PATERNITY OF AN INDEX , 1964 .

[107]  Cristopher Moore,et al.  On the bias of traceroute sampling: or, power-law degree distributions in regular graphs , 2005, STOC '05.

[108]  Peter Parnes,et al.  Characterizing user access to videos on the World Wide Web , 1999, Electronic Imaging.

[109]  Thorsten Joachims,et al.  Evaluating Search Engines using Clickthrough Data , 2002 .

[110]  Debora Donato,et al.  Large scale properties of the Webgraph , 2004 .

[111]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[112]  Brian D. Davison Web Traffic Logs: An Imperfect Resource for Evaluation , 1999 .

[113]  Stefan Savage,et al.  Inferring Internet denial-of-service activity , 2001, TOCS.

[114]  Michael Schatz,et al.  Learning Program Behavior Profiles for Intrusion Detection , 1999, Workshop on Intrusion Detection and Network Monitoring.

[115]  Fabien Mathieu,et al.  BackRank: an alternative for PageRank? , 2005, WWW '05.

[116]  Wei Yuan,et al.  Smoothing clickthrough data for web search ranking , 2009, SIGIR.

[117]  Junghoo Cho,et al.  Impact of search engines on page popularity , 2004, WWW '04.

[118]  Sandeep Pandey,et al.  Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results , 2005, VLDB.

[119]  Sebastiano Vigna,et al.  Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations , 2004, WAW.

[120]  Fabien Mathieu,et al.  The effect of the back button in a random walk: application for pagerank , 2004, WWW Alt. '04.

[121]  Leonid Portnoy,et al.  Intrusion detection with unlabeled data using clustering , 2000 .

[122]  Changjia Chen,et al.  Gnutella: Topology Dynamics On Phase Space , 2007, ArXiv.

[123]  Filippo Menczer,et al.  Evolution of document networks , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[124]  Vern Paxson,et al.  How to Own the Internet in Your Spare Time , 2002, USENIX Security Symposium.

[125]  Krishna P. Gummadi,et al.  Measurement study of peer-to-peer file system sharing , 2002 .

[126]  Hawoong Jeong,et al.  Modeling the Internet's large-scale topology , 2001, Proceedings of the National Academy of Sciences of the United States of America.