Evaluating strategies for similarity search on the web

Finding pages on the Web that are similar to a query page (Related Pages) is an important component of modern search engines. A variety of strategies have been proposed for answering Related Pages queries, but comparative evaluation by user studies is expensive, especially when large strategy spaces must be searched (e.g., when tuning parameters). We present a technique for automatically evaluating strategies using Web hierarchies, such as Open Directory, in place of user feedback. We apply this evaluation methodology to a mix of document representation strategies, including the use of text, anchor-text, and links. We discuss the relative advantages and disadvantages of the various approaches examined. Finally, we describe how to efficiently construct a similarity index out of our chosen strategies, and provide sample results from our index.

[1]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[2]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[3]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[6]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[7]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[8]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[9]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[10]  Einat Amitay,et al.  Using common hypertext links to identify the best phrasal description of target web documents , 1998 .

[11]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[12]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[13]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[14]  Lillian Lee,et al.  Measures of Distributional Similarity , 1999, ACL.

[15]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[16]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[17]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[18]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[19]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[20]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[21]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[22]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[23]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[24]  Giuseppe Attardi,et al.  Theseus: Categorization by Context , 2000 .