What's new on the web?: the evolution of the web from a search engine perspective

We seek to gain improved insight into how Web search engines shouldcope with the evolving Web, in an attempt to provide users with themost up-to-date results possible. For this purpose we collectedweekly snapshots of some 150 Web sites over the course of one year,and measured the evolution of content and link structure. Our measurements focus on aspects of potential interest to search engine designers: the evolution of link structure over time, the rate ofcreation of new pages and new distinct content on the Web, and the rate of change of the content of existing pages under search-centric measures of degree of change.Our findings indicate a rapid turnover rate of Web pages, i.e.,high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them. For pages that persistover time we found that, perhaps surprisingly, the degree of contentshift as measured using TF.IDF cosine distance does not appear to beconsistently correlated with the frequency of contentupdating. Despite this apparent non-correlation, the rate of content shift of a given page is likely to remain consistent over time. That is, pages that change a great deal in one week will likely change by a similarly large degree in the following week. Conversely, pages that experience little change will continue to experience little change. We conclude the paper with a discussion of the potential implications ofour results for the design of effective Web search engines.

[1]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[2]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[3]  Víctor Pàmies,et al.  Open Directory Project , 2003 .

[4]  Jeffrey Xu Yu,et al.  Proceedings of the Second International Conference on Advances in Web-Age Information Management , 2001 .

[5]  Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, Budapest, Hungary, May 20-24, 2003 , 2003, WWW.

[6]  Steve Chien,et al.  Approximating Aggregate Queries about Web Pages via Random Walks , 2000, VLDB.

[7]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[8]  Cyveillance Sizing the Internet , 2000 .

[9]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[10]  Jerry Kaufman,et al.  What ’ s New on the Web , 2001 .

[11]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[12]  Jeffrey Scott Vitter,et al.  Dynamic maintenance of web indexes using landmarks , 2003, WWW '03.

[13]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[14]  Peter Pirolli,et al.  Life, death, and lawfulness on the electronic frontier , 1997, CHI.

[15]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[17]  Jeffrey Scott Vitter,et al.  Characterizing Web Document Change , 2001, WAIM.

[18]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[19]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[20]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[21]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[22]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[23]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.