On URL and content persistence

This report presents a study of URL and content persistence among 51 million pages from a national web harvested 8 times over almost 3 years. This study differs from previous ones because it describes the evolution of a large set of web pages for several years, studying in depth the characteristics of persistent data. We found that the persistence of URLs and contents follows a logarithmic distribution. We characterized persistent URLs and contents, and identified reasons for URL death. We found that lasting contents tend to be referenced by different URLs during their lifetime. On the other hand, half of the contents referenced by persistent URLs did not change.

[1]  Jonathan D. Wren,et al.  404 not found: the stability and persistence of URLs published in MEDLINE , 2004, Bioinform..

[2]  David M. Pennock,et al.  Persistence of information on the web: analyzing citations contained in research articles , 2000, CIKM '00.

[3]  Michael L. Nelson,et al.  Object Persistence and Availability in Digital Libraries , 2002, D Lib Mag..

[4]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[5]  Daniel Gomes,et al.  Characterizing a national community web , 2005, TOIT.

[6]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[7]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[8]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[9]  Mary Rumsey Runaway Train: Problems of Permanence, Accessibility, and Stability in the Use of Web Sources in Law Review Citations , 2002 .

[10]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[11]  Renato Iannella,et al.  Uniform Resource Names (URN) Namespace Definition Mechanisms , 2002, RFC.

[12]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[13]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[14]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[15]  Geoffrey M. Voelker,et al.  Characterization of a Large Web Site Population with Implications for Content Delivery , 2004, WWW '04.

[16]  Lili Qiu,et al.  The content and access dynamics of a busy Web site: findings and implications , 2000 .

[17]  José Luis Borbinha,et al.  A Deposit for Digital Collections , 2001, ECDL.

[18]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[19]  Daniel Gomes,et al.  Managing duplicates in a web archive , 2006, SAC.

[20]  Diomidis Spinellis,et al.  The decay and failures of web references , 2003, CACM.

[21]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[22]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[23]  Junghoo Cho,et al.  Impact of search engines on page popularity , 2004, WWW '04.

[24]  Michael Day,et al.  Collecting and preserving the world wide web , 2003 .

[25]  David W. Brooks,et al.  “Link rot” limits the usefulness of web‐based educational materials in biochemistry and molecular biology * , 2003 .

[26]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.