Just-in-time recovery of missing web pages

We present Opal, a light-weight framework for interactively locating missing web pages (http status code 404). Opal is an example of "in vivo" preservation: harnessing the collective behavior of web archives, commercial search engines, and research projects for the purpose of preservation. Opal servers learn from their experiences and are able to share their knowledge with other Opal servers by mutual harvesting using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Using cached copies that can be found on the web, Opal creates lexical signatures which are then used to search for similar versions of the web page. We present the architecture of the Opal framework, discuss a reference implementation of the framework, and present a quantitative analysis of the framework that indicates that Opal could be effectively deployed.

[1]  William H. Mischo,et al.  Developing a Technical Registry of OAI Data Providers , 2004, ECDL.

[2]  David M. Pennock,et al.  Persistence of Web References in Scientific Research , 2001, Computer.

[3]  Johan Bollen,et al.  The Availability and Persistence of Web References in D-Lib Magazine , 2005, ArXiv.

[4]  Herbert Van de Sompel,et al.  Using the OAI-PMH ... Differently , 2003, D Lib Mag..

[5]  David M. Pennock,et al.  Analysis of lexical signatures for improving information persistence on the World Wide Web , 2004, TOIS.

[6]  Frank M. Shipman,et al.  Managing distributed collections: evaluating web page changes, movement, and replacement , 2004, JCDL.

[7]  Robert Wilensky,et al.  Robust Hyperlinks Cost Just Five Words Each , 2000 .

[8]  Herbert Van de Sompel,et al.  Notes from the Interoperability Front: A Progress Report on the Open Archives Initiative , 2002, ECDL.

[9]  Claudio Carpineto,et al.  FUB at TREC-10 Web Track: A Probabilistic Framework for Topic Relevance Term Weighting , 2001, TREC.

[10]  Herbert Van de Sompel,et al.  The OAI-PMH static repository and static repository gateway , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[11]  Sandeep Pandey,et al.  Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results , 2005, VLDB.

[12]  T. Scott Plutchak,et al.  Copyright Knowledge of Faculty at Two Academic Health Science Campuses: Results of a Survey , 2006 .

[13]  Terry L. Harrison,et al.  Opal: In Vivo Based Preservation Framework for Locating Lost Web Pages , 2005 .

[14]  Michael L. Nelson,et al.  Evaluation of crawling policies for a web-repository crawler , 2006, HYPERTEXT '06.

[15]  Brewster Kahle,et al.  Preserving the Internet , 1997 .

[16]  Herbert Van de Sompel,et al.  Open Archives Initiative - Protocol for Metadata Harvesting - v.2.0 , 2002 .

[17]  Masatoshi Yoshikawa,et al.  Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages , 2003, HYPERTEXT '03.

[18]  Mary Baker,et al.  The LOCKSS peer-to-peer digital preservation system , 2005, TOCS.

[19]  Z. Dalai,et al.  Managing distributed collections: evaluating Web page changes, movement, and replacement , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[20]  Michael L. Nelson,et al.  Observed Web Robot Behavior on Decaying Web Subsites , 2006, D Lib Mag..

[21]  Diomidis Spinellis,et al.  The decay and failures of web references , 2003, CACM.

[22]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[23]  Wang Jun Open Archives Initiative Protocol for Metadata Harvesting , 2005 .

[24]  Michael L. Nelson,et al.  Object Persistence and Availability in Digital Libraries , 2002, D Lib Mag..

[25]  Lada A. Adamic,et al.  Zipf's law and the Internet , 2002, Glottometrics.

[26]  Frank M. Shipman,et al.  Managing change on the web , 2001, JCDL '01.

[27]  Herbert Van de Sompel,et al.  The open archives initiative: building a low-barrier interoperability framework , 2001, JCDL '01.

[28]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD '00.

[29]  Wallace Koehler,et al.  Web page change and persistence - A four-year longitudinal study , 2002, J. Assoc. Inf. Sci. Technol..

[30]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.