Analysis of lexical signatures for finding lost or related documents

A lexical signature of a web page is often sufficient for finding the page, even if its URL has changed. We conduct a large-scale empirical study of eight methods for generating lexical signatures, including Phelps and Wilensky's [14] original proposal (PW) and seven of our own variations. We examine their performance on the web and on a TREC data set, evaluating their ability both to uniquely identify the original document and to locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. In general, TFIDF-based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates for generating effective lexical signatures.

[1]  James Casey,et al.  WebLinker, a Tool for Managing WWW Cross-References , 1995, Comput. Networks ISDN Syst..

[2]  William Y. Arms,et al.  An Architecture for Information in Digital Libraries , 1997, D Lib Mag..

[3]  Hermann A. Maurer,et al.  The Hyper-G Network Information System , 1996 .

[4]  Mark C. Little,et al.  Fixing the "Broken-Link" Problem: The W3Objects Approach , 1996, Comput. Networks.

[5]  Roy T. Fielding,et al.  Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web , 1994, Comput. Networks ISDN Syst..

[6]  Robert Wilensky,et al.  Robust Hyperlinks: Cheap, Everywhere, Now , 2000, DDEP/PODDP.

[7]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[8]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[9]  David M. Pennock,et al.  Persistence of Web References in Scientific Research , 2001, Computer.

[10]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[11]  Erik Wilde,et al.  Extended link visualization with DHTML: The Web as an open hypermedia system , 2002 .

[12]  Mark C. Little,et al.  W3Objects: Bringing Object-Oriented Technology to the Web , 1995, WWW.

[13]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[14]  Ian H. Witten,et al.  Managing gigabytes 2nd edition , 1999 .

[15]  Andrew V. Goldberg,et al.  Towards an archival Intermemory , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[16]  James E. Pitkow Summary of WWW characterizations , 2004, World Wide Web.

[17]  Karen R. Sollins,et al.  Functional Requirements for Uniform Resource Names , 1994, RFC.

[18]  Giles,et al.  Searching the world wide Web , 1998, Science.