Analysis of lexical signatures for improving information persistence on the World Wide Web

A <i>lexical signature</i> (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator called <i>Test & Select<</i> (TS) to mitigate LS conflict. TS outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines. All LS methods show significant performance degradation as documents in the corpus are edited.

[1]  Hermann A. Maurer,et al.  The Hyper-G Network Information System , 1996 .

[2]  Andrew V. Goldberg,et al.  Towards an archival Intermemory , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[3]  James E. Pitkow Summary of WWW characterizations , 2004, World Wide Web.

[4]  James E. Pitkow,et al.  Summary of WWW characterizations , 1998, World Wide Web.

[5]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[6]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[7]  K. J. Evans,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1990 .

[8]  S. T. Buckland,et al.  Computer-Intensive Methods for Testing Hypotheses. , 1990 .

[9]  James Casey,et al.  WebLinker, a Tool for Managing WWW Cross-References , 1995, Comput. Networks ISDN Syst..

[10]  William Y. Arms,et al.  An Architecture for Information in Digital Libraries , 1997, D Lib Mag..

[11]  Michael L. Creech,et al.  Author-Oriented Link Management , 1996, Comput. Networks.

[12]  Robert Wilensky,et al.  Robust Hyperlinks: Cheap, Everywhere, Now , 2000, DDEP/PODDP.

[13]  Giles,et al.  Searching the world wide Web , 1998, Science.

[14]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[15]  David M. Pennock,et al.  Persistence of Web References in Scientific Research , 2001, Computer.

[16]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[17]  Hermann A. Maurer,et al.  The Hyper-G Network Information System , 1995, J. Univers. Comput. Sci..

[18]  C. Lee Giles,et al.  Context and Page Analysis for Improved Web Search , 1998, IEEE Internet Comput..

[19]  Mark C. Little,et al.  Fixing the "Broken-Link" Problem: The W3Objects Approach , 1996, Comput. Networks.

[20]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[21]  Roy T. Fielding,et al.  Hypertext Transfer Protocol - HTTP/1.0 , 1996, RFC.

[22]  Rick Dobson,et al.  Weaving a better Web , 1998 .

[23]  Robert Wilensky,et al.  Robust Hyperlinks Cost Just Five Words Each , 2000 .

[24]  J. I The Design of Experiments , 1936, Nature.

[25]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[26]  Erik Wilde,et al.  Extended link visualization with DHTML: The Web as an open hypermedia system , 2002 .

[27]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[28]  Mark C. Little,et al.  W3Objects: Bringing Object-Oriented Technology to the Web , 1995, WWW.

[29]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[30]  David M. Pennock,et al.  Methods for Sampling Pages Uniformly from the World Wide Web , 2001 .

[31]  Roy T. Fielding,et al.  Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web , 1994, Comput. Networks ISDN Syst..

[32]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[33]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[34]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[35]  Vicky Reich,et al.  LOCKSS: A Permanent Web Publishing and Access System , 2001, D Lib Mag..

[36]  David M. Pennock,et al.  Analysis of lexical signatures for finding lost or related documents , 2002, SIGIR '02.

[37]  Scott M. Smith,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1989 .

[38]  Karen R. Sollins,et al.  Functional Requirements for Uniform Resource Names , 1994, RFC.