A results merging algorithm for distributed information retrieval environments that combines regression methodologies with a selective download phase

The problem of results merging in distributed information retrieval environments has gained significant attention the last years. Two generic approaches have been introduced in research. The first approach aims at estimating the relevance of the documents returned from the remote collections through ad hoc methodologies (such as weighted score merging, regression etc.) while the other is based on downloading all the documents locally, completely or partially, in order to calculate their relevance. Both approaches have advantages and disadvantages. Download methodologies are more effective but they pose a significant overhead on the process in terms of time and bandwidth. Approaches that rely solely on estimation on the other hand, usually depend on document relevance scores being reported by the remote collections in order to achieve maximum performance. In addition to that, regression algorithms, which have proved to be more effective than weighted scores merging algorithms, need a significant number of overlap documents in order to function effectively, practically requiring multiple interactions with the remote collections. The new algorithm that is introduced is based on adaptively downloading a limited, selected number of documents from the remote collections and estimating the relevance of the rest through regression methodologies. Thus it reconciles the above two approaches, combining their strengths, while minimizing their drawbacks, achieving the limited time and bandwidth overhead of the estimation approaches and the increased effectiveness of the download. The proposed algorithm is tested in a variety of settings and its performance is found to be significantly better than the former, while approximating that of the latter.

[1]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[2]  James C. French,et al.  The impact of database selection on distributed searching , 2000, SIGIR '00.

[3]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[4]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[5]  Milad Shokouhi,et al.  Capturing collection size for distributed non-cooperative retrieval , 2006, SIGIR.

[6]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[7]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[8]  Mounia Lalmas,et al.  SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , 2006 .

[9]  Ellen M. Voorhees,et al.  The Collection Fusion Problem , 1994, TREC.

[10]  Georgios Paltoglou,et al.  Results Merging Algorithm Using Multiple Regression Models , 2007, ECIR.

[11]  Shengli Wu,et al.  Shadow document methods of resutls merging , 2004, SAC '04.

[12]  Norbert Fuhr,et al.  From Uncertain Inference to Probability of Relevance for Advanced IR Applications , 2003, ECIR.

[13]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[14]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[15]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[16]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[17]  Luo Si,et al.  A semisupervised learning method to merge search engine results , 2003, TOIS.

[18]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[19]  David Hawking,et al.  Merging Results From Isolated Search Engines , 1999, Australasian Database Conference.

[20]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[21]  Luis Gravano,et al.  STARTS: Stanford Protocol Proposal for Internet Retrieval and Search , 1997 .

[22]  Jacques Savoy,et al.  Database merging strategy based on logistic regression , 2000, Inf. Process. Manag..

[23]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[24]  Ronald R. Yager,et al.  On the Fusion of Documents from Multiple Collection Information Retrieval Systems , 1998, J. Am. Soc. Inf. Sci..

[25]  Luo Si,et al.  The FedLemur project: Federated search in the real world , 2006 .

[26]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[27]  Peter Bailey,et al.  Engineering a multi-purpose test collection for Web retrieval experiments , 2003, Inf. Process. Manag..