Integrating preservation functions into the web server

Digital preservation of the World Wide Web poses unique challenges, different from the preservation issues facing professional Digital Libraries. The complete list of a website's resources cannot be cited with confidence, and the HyperText Transfer Protocol (HTTP) provides a bare minimum of metadata with each resource transfer—HTTP is optimized for access today rather than tomorrow. In short, the Web suffers from a counting problem and a representation problem. Refreshing the bits, migrating from an obsolete file format to a newer format, and other classic digital preservation problems also affect the Web. As digital collections devise solutions to these problems, the Web will also benefit. But the core World Wide Web problems of Counting and Representation need a targeted solution. As the host of web content, the web server is uniquely positioned to assist in the preservation of the resources it serves. It recognizes the resources it has, and knows what kind of resources they are. This dissertation presents research in which preservation functions have been integrated into the web server itself to produce archive-ready versions of the website's resources. The proposed approach addresses the Counting Problem through the use of Sitemaps, created from a combination of crawling, Sitemap tools, and log analysis. The Representation Problem is addressed by a preservation-preparation module installed on the web server. The module enables each resource to be packaged together with the output from a variety of relevant metadata utilities, creating the aforementioned archive-ready version of the resource. The CRATE Model defines a simple XML structure for the creation and delivery of such resources. A series of experiments which evaluated CRATE, Sitemaps, and extemporaneous metadata analysis of resources are presented, along with a technical review of the MODOAI web server module which acts as the preservation agent. The feasibility of this approach is demonstrated by a quantitative analysis of its use in a commercial web testing environment.

[1]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD 2000.

[2]  Dan Cohen Rough Start For Digital Preservation , 2006 .

[3]  Roy T. Fielding,et al.  Uniform Resource Identifier (URI): Generic Syntax , 2005, RFC.

[4]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[5]  Vicky Reich,et al.  Requirements for Digital Preservation Systems: A Bottom-Up Approach , 2005, D Lib Mag..

[6]  Michael L. Nelson,et al.  CRATE: A Simple Model for Self-Describing Web Resources , 2007 .

[7]  Andrew Waugh The design of the VERS encapsulated object experience with an archival information package , 2005, International Journal on Digital Libraries.

[8]  Johan Bollen,et al.  Archive Ingest and Handling Test: The Old Dominion University Approach , 2005, D Lib Mag..

[9]  Monika Henzinger,et al.  Hyperlink Analysis for the Web , 2001, IEEE Internet Comput..

[10]  Michael L. Nelson,et al.  Site Design Impact on Robots: An Examination of Search Engine Crawler Behavior at Deep and Wide Websites , 2008, D Lib Mag..

[11]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[12]  Herbert Van de Sompel,et al.  Using the OAI-PMH ... Differently , 2003, D Lib Mag..

[13]  Michael L. Nelson,et al.  Efficient, automatic web resource harvesting , 2006, WIDM '06.

[14]  Herbert Van de Sompel,et al.  mod_oai: An Apache Module for Metadata Harvesting , 2005, ECDL.

[15]  Peter S. Lyman Archiving the World Wide Web , 2002 .

[16]  Robert T. Braden,et al.  Requirements for Internet Hosts - Communication Layers , 1989, RFC.

[17]  Peter B. Danzig,et al.  The Harvest Information Discovery and Access System , 1995, Comput. Networks ISDN Syst..

[18]  Jane Greenberg,et al.  Final Report for the AMeGA (Automatic Metadata Generation Applications) Project , 2005 .

[19]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[20]  Magnus Karlsson,et al.  Dynamics and evolution of Web sites: analysis, metrics and design issues , 2001, Proceedings. Sixth IEEE Symposium on Computers and Communications.

[21]  Michael L. Nelson,et al.  Repository replication using SMTP and NNTP , 2006, DG.O.

[22]  E. James Whitehead,et al.  HTTP Extensions for Distributed Authoring - WEBDAV , 1999, RFC.

[23]  Ross Wilkinson,et al.  Preserving digital information forever , 2000, DL '00.

[24]  R. Fielding,et al.  Architectural Styles and the Design of Network-based Software Architectures (CHAPTER 5) , 2000 .

[25]  Lawrence Shaw Mayo,et al.  The Harvest of a Quiet Eye , 1928 .

[26]  Mary Baker,et al.  The LOCKSS peer-to-peer digital preservation system , 2005, TOCS.

[27]  Zhenyu Liu,et al.  A probabilistic approach to metasearching with adaptive probing , 2004, Proceedings. 20th International Conference on Data Engineering.

[28]  Carlos Castillo Cooperation schemes between a Web server and a Web search engine , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[29]  Michael L. Nelson,et al.  Creating Preservation-Ready Web Resources , 2008, D Lib Mag..

[30]  Jon Postel,et al.  DOD standard internet protocol , 1980, CCRV.

[31]  Idit Keidar,et al.  Do not crawl in the DUST: different URLs with similar text , 2006, WWW.

[32]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[33]  Andrew Waugh The Design and Implementation of an Ingest Function to a Digital Archive , 2007, D Lib Mag..

[34]  Michael L. Nelson,et al.  A Survey of Complex Object Technologies for Digital Libraries , 2001 .

[35]  Jeff Rothenberg Ensuring the Longevity of Digital Information , 1998 .

[36]  Clifford A. Lynch When documents deceive: trust and provenance as new factors for information retrieval in a tangled web , 2001 .

[37]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[38]  Nathaniel S. Borenstein,et al.  Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies , 1996, RFC.

[39]  Herbert Van de Sompel,et al.  Notes from the Interoperability Front: A Progress Report on the Open Archives Initiative , 2002, ECDL.

[40]  Herbert Van de Sompel,et al.  The OAI-PMH static repository and static repository gateway , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[41]  Gary S. Robinson,et al.  History and Impact of Computer Standards , 1996, Computer.

[42]  Henry M. Gladney,et al.  Trustworthy 100-year digital objects: Evidence after every witness is dead , 2004, TOIS.

[43]  Francois Yergeau,et al.  UTF-8, a transformation format of ISO 10646 , 1998, RFC.

[44]  Kurt Maly,et al.  Repository synchronization in the OAI framework , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[45]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[46]  IL University of Illinois at Urbana-Champaign,et al.  Inter-indexer consistency studies, 1954-1975: a review of the literature and summary of study results / , 2007 .

[47]  Bas Savenije,et al.  The National Library of the Netherlands , 2009 .

[48]  William Y. Arms Digital Libraries , 1999 .

[49]  Michal Cutler,et al.  The portrait of a common HTML web page , 2006, DocEng '06.

[50]  Johan Bollen,et al.  Reconstructing Websites for the Lazy Webmaster , 2005, ArXiv.

[51]  Carl Lagoze,et al.  The Open Archives Initiative Protocol for Metadata Harvesting Protocol , 2002 .

[52]  Hector Garcia-Molina,et al.  Crawler-Friendly Web Servers , 2000, PERV.

[53]  Nathaniel S. Borenstein,et al.  Multipurpose Internet Mail Extensions , 1992 .

[54]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[55]  Andrei Z. Broder,et al.  Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[56]  Herbert Van de Sompel,et al.  The open archives initiative , 2001 .

[57]  Johan Stapel Koninklijke Bibliotheek National Library of The Netherlands , 2003 .

[58]  Tim Berners-Lee,et al.  Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of Objects on the Network as used in the World-Wide Web , 1994, RFC.

[59]  Dave Johnson,et al.  RSS and Atom in Action: Web 2.0 Building Blocks , 2006 .

[60]  William Y. Arms Key concepts in the architecture of the digital library , 1995, D Lib Mag..

[61]  Clay Shirky,et al.  AIHT: Conceptual Issues from Practical Tests , 2005, D Lib Mag..

[62]  David Bearman Reality and Chimeras in the Preservation of Electronic Records , 1999, D Lib Mag..

[63]  Mo Chen,et al.  A practical system of keyphrase extraction for web pages , 2005, CIKM '05.

[64]  Rebecca S. Guenther,et al.  MODS: The Metadata Object Description Schema , 2003 .

[65]  Steven Pemberton,et al.  RDFa in XHTML: Syntax and Processing , 2008 .

[66]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[67]  M. Hauben,et al.  Netizens: On the History and Impact of Usenet and the Internet , 1998, First Monday.

[68]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[69]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[70]  Gary D. Scudder,et al.  On the selection of efficient record segmentations and backup strategies for large shared databases , 1984, TODS.

[71]  Lois Mai Chan,et al.  Inter-Indexer Consistency in Subject Cataloging. , 1989 .

[72]  Betty Furrie,et al.  Understanding Marc Bibliographic: Machine-Readable Cataloging , 2003 .

[73]  Kevin Hemenway,et al.  Spidering Hacks , 2003 .

[74]  Mohammad Zubair,et al.  Search engine coverage of the OAI-PMH corpus , 2006, IEEE Internet Computing.

[75]  Roy T. Fielding,et al.  Hypertext Transfer Protocol - HTTP/1.0 , 1996, RFC.

[76]  Vivian Cothey,et al.  Web-crawling reliability , 2004, J. Assoc. Inf. Sci. Technol..

[77]  Herbert Van de Sompel,et al.  The Santa Fe Convention of the Open Archives Initiative , 2000, D Lib Mag..

[78]  Filippo Menczer,et al.  Crawling the Web , 2004, Web Dynamics.

[79]  Jerome McDonough,et al.  METS: standardized encoding for digital library objects , 2006, International Journal on Digital Libraries.

[80]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[81]  Clifford A. Lynch,et al.  When documents deceive: Trust and provenance as new factors for information retrieval in a tangled web , 2001, J. Assoc. Inf. Sci. Technol..

[82]  Michael L. Nelson,et al.  How much preservation do I get if I do absolutely nothing? Using the Web Infrastructure for Digital Preservation , 2007 .

[83]  Rick Bennett,et al.  Trends in the Evolution of the Public Web: 1998 - 2002 , 2003, D Lib Mag..

[84]  Herbert Van de Sompel,et al.  Object Re-Use & Exchange: A Resource-Centric Approach , 2008, ArXiv.

[85]  Darren R. Hardy,et al.  Customized information extraction as a basis for resource discovery , 1996, TOCS.

[86]  Luke Rodgers What is RSS , 2008 .

[87]  Philip R. Zimmermann,et al.  The official PGP user's guide , 1996 .

[88]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[89]  William J. Broad US Web Archive Is Said to Reveal a Nuclear Primer , 2006 .

[90]  Geoffrey M. Voelker,et al.  Characterization of a Large Web Site Population with Implications for Content Delivery , 2004, WWW '04.

[91]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[92]  Charles F. Thomas,et al.  Who Will Create The Metadata For The Internet? , 1998, First Monday.

[93]  Ricardo A. Baeza-Yates,et al.  Characterization of national Web domains , 2007, TOIT.

[94]  Michael L. Nelson,et al.  Factors affecting website reconstruction from the web infrastructure , 2007, JCDL '07.

[95]  Michael L. Nelson,et al.  Using the web infrastructure to preserve web pages , 2007, International Journal on Digital Libraries.

[96]  Michael L. Nelson,et al.  Generating best-effort preservation metadata for web resources at time of dissemination , 2007, JCDL '07.

[97]  David M. Levy,et al.  Heroic measures: reflections on the possibility and purpose of digital preservation , 1998, DL '98.

[98]  Stuart Weibel Metadata: the foundations of resource description , 1995, D Lib Mag..

[99]  Michael L. Nelson,et al.  Repository Replication Using NNTP and SMTP , 2006, ECDL.

[100]  Michael L. Nelson,et al.  Brass: A queueing manager for Warrick , 2007 .

[101]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[102]  Roy T. Fielding,et al.  Hypertext Transfer Protocol - HTTP/1.0 , 1996, RFC.

[103]  Michael L. Nelson,et al.  Observed Web Robot Behavior on Decaying Web Subsites , 2006, D Lib Mag..

[104]  Herbert Van de Sompel,et al.  Representing digital assets usingMPEG-21 Digital Item Declaration , 2005, International Journal on Digital Libraries.

[105]  Andrew H. Mutz,et al.  Transparent Content Negotiation in HTTP , 1998, RFC.

[106]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[107]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[108]  Herbert Van de Sompel,et al.  A Standards-based Solution for the Accurate Transfer of Digital Assets , 2005, D Lib Mag..

[109]  Herbert Van de Sompel,et al.  IJDL special issue on complex digital objects: Guest editors' introduction , 2005, International Journal on Digital Libraries.

[110]  Michael L. Nelson,et al.  A Quantitative Evaluation of Dissemination-Time Preservation Metadata , 2008, ECDL.

[111]  Herbert Van de Sompel,et al.  Open Archives Initiative - Protocol for Metadata Harvesting - Guidelines for Repository Implementers , 2005 .

[112]  Simon Josefsson,et al.  The Base16, Base32, and Base64 Data Encodings , 2003, RFC.

[113]  Clifford A. Lynch,et al.  Canonicalization: A Fundamental Tool to Facilitate Preservation and Management of Digital Information , 1999, D-Lib Magazine.

[114]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[115]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD '00.

[116]  Herbert Van de Sompel,et al.  Resource Harvesting within the OAI-PMH Framework , 2004, D Lib Mag..

[117]  William Y. Arms Preservation of Scientific Serials: Three Current Examples , 1999 .

[118]  Tim Berners-Lee,et al.  Uniform Resource Locators (URL) , 1994, RFC.

[119]  Robert Wilensky,et al.  A framework for distributed digital object services , 2006, International Journal on Digital Libraries.

[120]  Nathaniel S. Borenstein,et al.  MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies , 1992, RFC.

[121]  Eric Miller,et al.  An Introduction to the Resource Description Framework , 1998, D Lib Mag..

[122]  Ling Liu,et al.  Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web , 2004, Proceedings. 20th International Conference on Data Engineering.

[123]  Clay Shirky Library of Congress Archive Ingest and Handling Test (AIHT) Final Report , 2006 .

[124]  Yuval Shavitt,et al.  Constrained mirror placement on the Internet , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[125]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.