Combining document representations for known-item search

This paper investigates the pre-conditions for successful combination of document representations formed from structural markup for the task of known-item search. As this task is very similar to work in meta-search and data fusion, we adapt several hypotheses from those research areas and investigate them in this context. To investigate these hypotheses, we present a mixture-based language model and also examine many of the current meta-search algorithms. We find that compatible output from systems is important for successful combination of document representations. We also demonstrate that combining low performing document representations can improve performance, but not consistently. We find that the techniques best suited for this task are robust to the inclusion of poorly performing document representations. We also explore the role of variance of results across systems and its impact on the performance of fusion, with the surprising result that the correct documents have higher variance across document representations than highly ranking incorrect documents.

[1]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[2]  Jacques Savoy,et al.  Report on the TREC-5 Experiment: Data Fusion and Collection Fusion , 1996, TREC.

[3]  Stephen E. Robertson,et al.  Threshold Setting and Performance Optimization in Adaptive Filtering , 2002, Information Retrieval.

[4]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[5]  Sung-Hyon Myaeng,et al.  A flexible model for retrieval of SGML documents , 1998, SIGIR '98.

[6]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[7]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[8]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[9]  Javed A. Aslam,et al.  Condorcet fusion for improved retrieval , 2002, CIKM '02.

[10]  Kevyn Collins-Thompson,et al.  Information Filtering, Novelty Detection, and Named-Page Finding , 2002, TREC.

[11]  Ophir Frieder,et al.  Analyses of multiple-evidence combinations for retrieval strategies , 2001, SIGIR '01.

[12]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.

[13]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[14]  David Hawking,et al.  Overview of the TREC-2001 Web track , 2002 .

[15]  Garrison W. Cottrell,et al.  Fusion Via a Linear Combination of Scores , 1999, Information Retrieval.

[16]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[17]  Dong-Yul Ra,et al.  Web Document Retrieval Using Sentence-Query Similarity , 2002, TREC.