Relevance and Overlap Aware Text Collection

In an environment of distributed text collections, the flrst step in the information retrieval process is to identify which of all available collections are more relevant to a given query and should thus be accessed to answer the query. Collection selection is di‐cult due to the varying relevance of sources as well as the overlap between these sources. Previous collection selection methods have considered relevance of the collections but have ignored overlap among collections. They thus make the unrealistic assumption that the collections are all efiectively disjoint. In this paper, we describe ROSCO, an approach for collection selection which handles collection relevance as well as overlap. We start by developing methods for estimating the statistics concerning size, relevance, and overlap that are necessary to support collection selection. We then explain how ROSCO selects text collections based upon these statistics. Finally, we demonstrate the efiectiveness of ROSCO by comparing it to major text collection selection algorithm ReDDE under a variety of scenarios.

[1]  Subbarao Kambhampati,et al.  Improving text collection selection with coverage and overlap statistics , 2005, WWW '05.

[2]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[3]  Peter Bailey,et al.  Server selection on the World Wide Web , 2000, DL '00.

[4]  Justin Zobel,et al.  Redundant documents and search effectiveness , 2005, CIKM '05.

[5]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[6]  Subbarao Kambhampati,et al.  A frequency-based approach for mining coverage statistics in data integration , 2004, Proceedings. 20th International Conference on Data Engineering.

[7]  Norbert Fuhr,et al.  Combining CORI and the Decision-Theoretic Approach for Advanced Resource Selection , 2004, ECIR.

[8]  W. Bruce Croft,et al.  Searching Distributed Collections With Inference Networks , 2017, SIGF.

[9]  Adele E. Howe,et al.  SAVVYSEARCH: A Metasearch Engine That Learns Which Search Engines to Query , 1997, AI Mag..

[10]  Zhenyu Liu,et al.  A probabilistic approach to metasearching with adaptive probing , 2004, Proceedings. 20th International Conference on Data Engineering.

[11]  Ellen M. Voorhees,et al.  The Collection Fusion Problem , 1994, TREC.

[12]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[13]  Gerhard Weikum,et al.  Improving collection selection with overlap awareness in P2P search engines , 2005, SIGIR '05.

[14]  James C. French,et al.  Comparing the performance of collection selection algorithms , 2003, TOIS.