Persistence of information on the web: analyzing citations contained in research articles

We analyze the persistence of information on the web, looking at the percentage of invalid URLs contained in academic articles within the CiteSeer (ResearchIndex) database. The number of URLs contained in the papers has increased from an average of 0.06 in 1993 to 1.6 in 1999. We found that a significant percentage of URLs are now invalid, ranging from 23% for 1999 articles, to 53% for 1994. We also found that for almost all of the invalid URLs, it was possible to locate the information (or highly related information) in an alternate location, primarily with the use of search engines. However, the ability to relocate missing information varied according to search experience and effort expended. Citation practices suggest that more information may be lost in the future unless these practices are improved. We discuss persistent URL standards and their usage, and give recommendations for citing URLs in research articles as well as for finding the new location of invalid URLs.

[1]  Hermann A. Maurer,et al.  The Hyper-G Network Information System , 1995, J. Univers. Comput. Sci..

[2]  C. Lee Giles,et al.  Context and Page Analysis for Improved Web Search , 1998, IEEE Internet Comput..

[3]  Mark C. Little,et al.  Fixing the "Broken-Link" Problem: The W3Objects Approach , 1996, Comput. Networks.

[4]  Andrew V. Goldberg,et al.  Towards an archival Intermemory , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[5]  Tina Eliassi-Rad,et al.  Intelligent Agents for Web-based Tasks: An Advice-Taking Approach , 1998 .

[6]  C. Lee Giles,et al.  Indexing and retrieval of scientific literature , 1999, CIKM '99.

[7]  Karen R. Sollins,et al.  Functional Requirements for Uniform Resource Names , 1994, RFC.

[8]  Andrew V. Goldberg,et al.  A prototype implementation of archival Intermemory , 1999, DL '99.

[9]  Rick Dobson,et al.  Weaving a better Web , 1998 .

[10]  Robert Wilensky,et al.  Robust Hyperlinks Cost Just Five Words Each , 2000 .

[11]  Hermann A. Maurer,et al.  The Hyper-G Network Information System , 1996 .

[12]  William Y. Arms,et al.  An Architecture for Information in Digital Libraries , 1997, D Lib Mag..

[13]  Michael L. Creech,et al.  Author-Oriented Link Management , 1996, Comput. Networks.

[14]  Andrew McCallum,et al.  Building Domain-Specific Search Engines with Machine Learning Techniques , 1999 .

[15]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[16]  Joseph Gibaldi MLA style manual and guide to scholarly publishing , 1999 .

[17]  X. Zhang,et al.  Version Augmented URIs for Reference Permanence via an Apache Module Design , 1998, Comput. Networks.

[18]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[19]  Geert-Jan Houben,et al.  A formal approach to analyzing the browsing semantics of hypertext , 1994 .

[20]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[21]  Vicky Reich,et al.  Permanent Web Publishing , 2000, USENIX Annual Technical Conference, FREENIX Track.