Combining Text- and Link-based Retrieval Methods for Web IR

The characteristics of Web search environment, namely the document characteristics and the searcher behavior on the Web, confound the problems of Information Retrieval (IR). The massive, heterogeneous, dynamic, and distributed Web document collection as well as the unpredictable and less than ideal querying behavior of a typical Web searcher exacerbate conventional IR problems and diminish the effectiveness of retrieval approaches proven in the laboratory conditions of traditional IR. At the same time, the Web is rich with various sources of information that go beyond the contents of documents, such as document characteristics, hyperlinks, Web directories (e.g. Yahoo), and user statistics. Fusion IR studies have repeatedly shown that combining multiple sources of evidence can improve retrieval performance. Furthermore, the nature of the Web search environment is such that retrieval approaches based on single sources of evidence suffer from weaknesses that can hurt the retrieval performance in certain situations. For example, content-based IR approaches have difficulty dealing with the diversity in vocabulary and quality of web documents, while link-based approaches can suffer from incomplete or noisy link topology. The inadequacies of singular Web IR approaches coupled with the fusion hypothesis (i.e. “fusion is good for IR”) make a strong argument for combining multiple sources of evidence as a potentially advantageous retrieval strategy for Web IR. Among the various source of evidence on the Web, we focused our TREC-10 efforts on leveraging document text and hyperlinks, and examined the effects of combining result sets as well as those of various evidence source parameters.

[1]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[2]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[3]  Kelly Maglaughlin,et al.  IRIS at TREC-8 , 1999, TREC.

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[6]  J. Lee Combining Multiple Evidence from Different Relevance Feedback Met hods , 2000 .

[7]  Jacques Savoy,et al.  Report on the TREC-9 Experiment: Link-based Retrieval and Distributed Collections , 2000, TREC.

[8]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[9]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[10]  Joon Ho Lee Combining Multiple Evidence from Different Relevant Feedback Networks , 1997, DASFAA.

[11]  Chris Buckley,et al.  Using Query Zoning and Correlation Within SMART: TREC 5 , 1996, TREC.

[12]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[13]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[14]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[15]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[16]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[17]  Alan F. Smeaton,et al.  Dublin City University Experiments in Connectivity Analysis for TREC-9 , 2000, TREC.

[18]  J. L. Harrison,et al.  The Government Printing Office , 1968, American Journal of Pharmaceutical Education.