Digital libraries and World Wide Web sites and page persistence

Web pages and Web sites, some argue, can either be collected as elements of digital or hybrid libraries, or, as others would have it, the WWW is itself a library. We begin with the assumption that Web pages and Web sites can be collected and categorized. The paper explores the proposition that the WWW constitutes a library. We conclude that the Web is not a digital library. However, its component parts can be aggregated and included as parts of digital library collections. These, in turn, can be incorporated into "hybrid libraries." These are libraries with both traditional and digital collections. Material on the Web can be organized and managed. Native documents can be collected in situ, disseminated, distributed, catalogueed, indexed, controlled, in traditional library fashion. The Web therefore is not a library, but material for library collections is selected from the Web. That said, the Web and its component parts are dynamic. Web documents undergo two kinds of change. The first type, the type addressed in this paper, is "persistence" or the existence or disappearance of Web pages and sites, or in a word the lifecycle of Web documents. "Intermittence" is a variant of persistence, and is defined as the disappearance but reappearance of Web documents. At any given time, about five percent of Web pages are intermittent, which is to say they are gone but will return. Over time a Web collection erodes. Based on a 120-week longitudinal study of a sample of Web documents, it appears that the half-life of a Web page is somewhat less than two years and the half-life of a Web site is somewhat more than two years. That is to say, an unweeded Web document collection created two years ago would contain the same number of URLs, but only half of those URLs point to content. The second type of change Web documents experience is change in Web page or Web site content. Again based on the Web document samples, very nearly all Web pages and sites undergo some form of content within the period of a year. Some change content very rapidly while others do so infrequently (Koehler, 1999a). This paper examines how Web documents can be efficiently and effectively incorporated into library collections. This paper focuses on Web document lifecycles: persistence, attrition, and intermittence. While the frequency of content change has been reported (Koehler, 1999a), the degree to which those changes effect meaning and therefore the integrity of bibliographic representation is yet not fully understood. The dynamics of change sets Web libraries apart from the traditional library as well as many digital libraries. This paper seeks then to further our understanding of the Web page and Web site lifecycle. These patterns challenge the integrity and the usefulness of libraries with Web content. However, if these dynamics are understood, they can be controlled for or managed.

[1]  José-Marie Griffiths Why the Web is not a Library , 1999 .

[2]  Heting Chu Hyperlinks: How Well Do They Represent the Intellectual Content of Digital Collections?. , 1997 .

[3]  Allison Woodruff,et al.  An Investigation of Documents from the World Wide Web , 1996, Comput. Networks.

[4]  Nathanial S. Borenstein Programming as if People Mattered: Friendly Programs, Software Engineering, and Other Noble Delusions , 1991 .

[5]  Stephen Pinfield,et al.  Realizing the Hybrid Library , 1998, D Lib Mag..

[6]  KoehlerWallace An analysis of Web page and Web site constancy and permanence , 1999 .

[7]  Bob Travica,et al.  Web as Global Virtual Library: Usability of Business Sites in East and Central Europe , 1998 .

[8]  David Ellis,et al.  In search of the unknown user: indexing, hypertext and the world wide web , 1998, J. Documentation.

[9]  Laurie S. Linsley Electronic resources: Selection and bibliographic control , 1997 .

[10]  Nicholas G. Tomaiuolo,et al.  An analysis of Internet search engines: assessment of over 200 search queries , 1996 .

[11]  Robert J. Sandusky Practical digital libraries: Books, bytes, and bucks , 1998 .

[12]  Kenneth Arnold The Electronic Librarian Is a Verb / The Electronic Library Is Not a Sentence , 1995 .

[13]  Kyle Banerjee Describing Remote Electronic Documents in the Online Catalog: Current Issues , 1998 .

[14]  Martin Dillon,et al.  Assessing Information on the Internet , 2001 .

[15]  Amanda Spink,et al.  Searching heterogeneous collections on the Web: behaviour of Excite users , 1998, Inf. Res..

[16]  Giles,et al.  Searching the world wide Web , 1998, Science.

[17]  Steven J. DeRose,et al.  Expanding the notion of links , 1989, Hypertext.

[18]  Tom Carey,et al.  Labeled, typed links as cues when reading hypertext documents , 1996 .

[19]  Wallace Koehler,et al.  An Analysis of Web Page and Web Site Constancy and Permanence , 1999, J. Am. Soc. Inf. Sci..

[20]  Fred Douglis,et al.  WebGUIDE: Querying and Navigating Changes in Web Repositories , 1996, Comput. Networks.

[21]  Wallace Koehler,et al.  Cataloging challenges in an Area Studies Virtual Library Catalog (ASVLC): Results of a case study , 1999 .

[22]  Wallace Koehler,et al.  FirstSearch and NetFirst--Web and Dial-up Access: Plus Ca Change, Plus C'est la Meme Chose?. , 1996 .

[23]  Erik Jul Now That We Know the Answer, What Are the Questions? , 1998 .

[24]  Peter Ingwersen,et al.  Informetric analyses on the world wide web: methodological approaches to 'webometrics' , 1997, J. Documentation.

[25]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[26]  Colin Johnston,et al.  Electronic technology and its impact on libraries , 1998 .

[27]  NEWS: New Search Strategy Untangles the Web , 1998, Science.

[28]  Stephanie W. Haas,et al.  A Link Taxonomy for Web Pages. , 1998 .

[29]  Priscilla Caplan Controlling E-Journals: The Internet Resources Project, Cataloging Guidelines, and USMARC , 1994 .

[30]  Saul Greenberg,et al.  Revisitation patterns in World Wide Web navigation , 1997, CHI.

[31]  Wallace Koehler,et al.  Automating the Dynamic Development and Maintenance of a Distributed Digital Collection: The Area Studies Digital Library (ASDL) , 1997 .

[32]  Tim Bray,et al.  Measuring the Web , 1996, World Wide Web J..

[33]  K. Ladizesky Libraries and associations in the Transient World : new technologies and new forms of cooperation , 1997 .

[34]  M. Forrester,et al.  Indexing in hypertext environments: the role of user models , 1995, The Indexer: The International Journal of Indexing: Volume 19, Issue 4.

[35]  Nicole Auer Bibliography on Evaluating Internet Resources. , 1998 .

[36]  Craig Locatis,et al.  Searching through cyberspace: the effects of link display and link density on information retrieval from hypertext on the World Wide Web , 1998 .

[37]  Bipin C. Desai Supporting Discovery in Virtual Libraries , 1997, J. Am. Soc. Inf. Sci..