Using the web infrastructure to preserve web pages

To date, most of the focus regarding digital preservation has been on replicating copies of the resources to be preserved from the “living web” and placing them in an archive for controlled curation. Once inside an archive, the resources are subject to careful processes of refreshing (making additional copies to new media) and migrating (conversion to new formats and applications). For small numbers of resources of known value, this is a practical and worthwhile approach to digital preservation. However, due to the infrastructure costs (storage, networks, machines) and more importantly the human management costs, this approach is unsuitable for web scale preservation. The result is that difficult decisions need to be made as to what is saved and what is not saved. We provide an overview of our ongoing research projects that focus on using the “web infrastructure” to provide preservation capabilities for web pages and examine the overlap these approaches have with the field of information retrieval. The common characteristic of the projects is they creatively employ the web infrastructure to provide shallow but broad preservation capability for all web pages. These approaches are not intended to replace conventional archiving approaches, but rather they focus on providing at least some form of archival capability for the mass of web pages that may prove to have value in the future. We characterize the preservation approaches by the level of effort required by the web administrator: web sites are reconstructed from the caches of search engines (“lazy preservation”); lexical signatures are used to find the same or similar pages elsewhere on the web (“just-in-time preservation”); resources are pushed to other sites using NNTP newsgroups and SMTP email attachments (“shared infrastructure preservation”); and an Apache module is used to provide OAI-PMH access to MPEG-21 DIDL representations of web pages (“web server enhanced preservation”).

[1]  Clay Shirky,et al.  AIHT: Conceptual Issues from Practical Tests , 2005, D Lib Mag..

[2]  Jon Postel,et al.  Simple Mail Transfer Protocol , 1981, RFC.

[3]  Carl Lagoze,et al.  Focused Crawls, Tunneling, and Digital Libraries , 2002, ECDL.

[4]  John C. Klensin,et al.  Simple Mail Transfer Protocol , 2001, RFC.

[5]  Ian Clarke,et al.  Protecting Free Expression Online with Freenet , 2002, IEEE Internet Comput..

[6]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD '00.

[7]  Curtis E. Dyreson,et al.  Managing versions of web documents in a transaction-time web server , 2004, WWW '04.

[8]  Herbert Van de Sompel,et al.  Resource Harvesting within the OAI-PMH Framework , 2004, D Lib Mag..

[9]  Wallace Koehler,et al.  Web page change and persistence - A four-year longitudinal study , 2002, J. Assoc. Inf. Sci. Technol..

[10]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[11]  M. GladneyH.,et al.  Trustworthy 100-year digital objects , 2005 .

[12]  Sriram Raghavan,et al.  Stanford WebBase components and applications , 2006, TOIT.

[13]  Idit Keidar,et al.  Do not crawl in the DUST: different URLs with similar text , 2006, WWW.

[14]  Joachim Feise An approach to persistence of Web resources , 2001, HYPERTEXT '01.

[15]  Sandra Payette,et al.  The Mellon Fedora Project , 2002, ECDL.

[16]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[17]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[18]  Michael L. Nelson,et al.  Observed Web Robot Behavior on Decaying Web Subsites , 2006, D Lib Mag..

[19]  Herbert Van de Sompel,et al.  Representing digital assets usingMPEG-21 Digital Item Declaration , 2005, International Journal on Digital Libraries.

[20]  Michael L. Nelson,et al.  Repository Replication Using NNTP and SMTP , 2006, ECDL.

[21]  Stevan Harnad,et al.  Applications, Potential Problems and a Suggested Policy for Institutional E-Print Archives , 2002 .

[22]  Ricardo A. Baeza-Yates,et al.  Crawling the Infinite Web: Five Levels Are Enough , 2004, WAW.

[23]  Andrei Z. Broder,et al.  Efficient URL caching for world wide web crawling , 2003, WWW '03.

[24]  Roy H. Campbell,et al.  Internet search engine freshness by Web server help , 2001, Proceedings 2001 Symposium on Applications and the Internet.

[25]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[26]  David M. Pennock,et al.  Analysis of lexical signatures for improving information persistence on the World Wide Web , 2004, TOIS.

[27]  Johan Bollen,et al.  The Availability and Persistence of Web References in D-Lib Magazine , 2005, ArXiv.

[28]  Mary Baker,et al.  The LOCKSS peer-to-peer digital preservation system , 2005, TOCS.

[29]  William Y. Arms,et al.  Building a research library for the history of the web , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[30]  Carl Lagoze,et al.  Core services in the architecture of the national science digital library (NSDL) , 2002, JCDL '02.

[31]  Norman Paskin Digital object identifiers , 2002 .

[32]  Michael L. Nelson,et al.  Lazy preservation: reconstructing websites by crawling the crawlers , 2006, WIDM '06.

[33]  Robert Wilensky,et al.  Robust Hyperlinks Cost Just Five Words Each , 2000 .

[34]  Tony Hammond,et al.  Social Bookmarking Tools (I): A General Overview , 2005, D Lib Mag..

[35]  Herbert Van de Sompel,et al.  Notes from the Interoperability Front: A Progress Report on the Open Archives Initiative , 2002, ECDL.

[36]  Jerome McDonough,et al.  METS: standardized encoding for digital library objects , 2006, International Journal on Digital Libraries.

[37]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[38]  Serge Abiteboul,et al.  A First Experience in Archiving the French Web , 2002, ECDL.

[39]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[40]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[41]  Michael L. Nelson,et al.  Just-in-time recovery of missing web pages , 2006, HYPERTEXT '06.

[42]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[43]  A. Rauber,et al.  Austrian On-Line Archive Processing : Analyzing Archives of the World Wide Web , 2002 .

[44]  David M. Pennock,et al.  Persistence of Web References in Scientific Research , 2001, Computer.

[45]  Brewster Kahle,et al.  Preserving the Internet , 1997 .

[46]  Johan Bollen,et al.  Archive Ingest and Handling Test: The Old Dominion University Approach , 2005, D Lib Mag..

[47]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[48]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[49]  Michael L. Nelson,et al.  Object Persistence and Availability in Digital Libraries , 2002, D Lib Mag..

[50]  Diomidis Spinellis,et al.  The decay and failures of web references , 2003, CACM.

[51]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[52]  Ben Y. Zhao,et al.  Maintenance-Free Global Data Storage , 2001, IEEE Internet Comput..

[53]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[54]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[55]  Micah Beck,et al.  An end-to-end approach to globally scalable network storage , 2002, SIGCOMM '02.

[56]  Herbert Van de Sompel,et al.  Using MPEG-21 DIDL to Represent Complex Digital Objects in the Los Alamos National Laboratory Digital Library , 2003, D Lib Mag..

[57]  Andrew V. Goldberg,et al.  A prototype implementation of archival Intermemory , 1999, DL '99.

[58]  Hector Garcia-Molina,et al.  Peer-to-peer data trading to preserve information , 2002, TOIS.

[59]  Johan Bollen,et al.  Reconstructing Websites for the Lazy Webmaster , 2005, ArXiv.

[60]  Andreas Rauber,et al.  Austrian Online Archive Processing: Analyzing Archives of the World Wide Web , 2002, ECDL.

[61]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[62]  James S. Plank A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[63]  Hector Garcia-Molina,et al.  Crawler-Friendly Web Servers , 2000, PERV.

[64]  John R. Garrett,et al.  Avoiding Technological Quicksand : Finding a Viable Technical Foundation for Digital Preservation , 2009 .

[65]  Andrei Z. Broder,et al.  Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[66]  Herbert Van de Sompel,et al.  aDORe: a modular and standards-based digital object repository at the los alamos national laboratory , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[67]  Reagan Moore,et al.  MySRB & SRB: Components of a Data Grid , 2002 .

[68]  Sandeep Pandey,et al.  Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results , 2005, VLDB.

[69]  Roy Fielding,et al.  Architectural Styles and the Design of Network-based Software Architectures"; Doctoral dissertation , 2000 .

[70]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[71]  Roger Dingledine,et al.  The Free Haven Project: Distributed Anonymous Storage Service , 2000, Workshop on Design Issues in Anonymity and Unobservability.

[72]  Michael L. Nelson,et al.  Efficient, automatic web resource harvesting , 2006, WIDM '06.

[73]  Herbert Van de Sompel,et al.  mod_oai: An Apache Module for Metadata Harvesting , 2005, ECDL.

[74]  Michael L. Nelson,et al.  Repository replication using SMTP and NNTP , 2006, DG.O.

[75]  Michalis Vazirgiannis,et al.  Archiving the Greek Web , 2004 .

[76]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[77]  Ming-Feng Chen,et al.  A proxy-based personal web archiving service , 2001, OPSR.

[78]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[79]  Henry M. Gladney,et al.  Trustworthy 100-year digital objects: Evidence after every witness is dead , 2004, TOIS.

[80]  Reagan Moore,et al.  MySRB and SRB - components of a Data Grid , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[81]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[82]  John Garrett,et al.  Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. , 1996 .

[83]  J.L. Marill,et al.  Tools and techniques for harvesting the World Wide Web , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[84]  Hector Garcia-Molina,et al.  InfoMonitor: unobtrusively archiving a World Wide Web server , 2005, International Journal on Digital Libraries.

[85]  Michael Day,et al.  Collecting and preserving the world wide web , 2003 .

[86]  MacKenzie Smith,et al.  The DSpace institutional digital repository system: current functionality , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[87]  Chabane Djeraba,et al.  High performance crawling system , 2004, MIR '04.

[88]  Larry Lannom,et al.  Handle System Overview , 2003, RFC.

[89]  Herbert Van de Sompel,et al.  The open archives initiative: building a low-barrier interoperability framework , 2001, JCDL '01.

[90]  Gul A. Agha,et al.  Crawlets: Agents for High Performance Web Search Engines , 2001, Mobile Agents.

[91]  Brian Kantor,et al.  Network News Transfer Protocol , 1986, RFC.

[92]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[93]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[94]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[95]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[96]  Terry L. Harrison,et al.  Opal: In Vivo Based Preservation Framework for Locating Lost Web Pages , 2005 .

[97]  Michael L. Nelson,et al.  Evaluation of crawling policies for a web-repository crawler , 2006, HYPERTEXT '06.

[98]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[99]  Hector Garcia-Molina,et al.  Implementing a Reliable Digital Object Archive , 2000, ECDL.