Relevance and Overlap Aware Text Collection Selection

In an environment of distributed text collections, the first step in the information retrieval process is to identify which of all available collections are more relevant to a given query and should thus be accessed to answer the query. Collection selection is difficult due to the varying relevance of sources as well as the overlap between these sources. Previous collection selection methods have considered relevance of the collections but have ignored overlap among collections. They thus make the unrealistic assumption that the collections are all effectively disjoint. In this paper, we describe two new approaches: (i) COSCO which handles collection overlap, and (ii) ROSCO, which builds on COSCO to handle both collection relevance and collection overlap. We start by developing methods for estimating the statistics concerning size, relevance, and overlap that are necessary to support collection selection. We then explain how COSCO and ROSCO select text collections based upon these statistics. Finally, we demonstrate the effectiveness of COSCO and ROSCO by comparing them to major text collection selection algorithms (CORI and RDDE) under a variety of scenarios. Our evaluation is based on a set of 8 testbeds drawn from online scientific paper collections that vary systematically across relevance, overlap and size.

[1]  Peter Bailey,et al.  Server selection on the World Wide Web , 2000, DL '00.

[2]  Subbarao Kambhampati,et al.  A frequency-based approach for mining coverage statistics in data integration , 2004, Proceedings. 20th International Conference on Data Engineering.

[3]  Clement T. Yu,et al.  Towards a highly-scalable and effective metasearch engine , 2001, WWW '01.

[4]  Yi Zhang,et al.  Novelty and redundancy detection in adaptive filtering , 2002, SIGIR '02.

[5]  Norbert Fuhr,et al.  Combining CORI and the Decision-Theoretic Approach for Advanced Resource Selection , 2004, ECIR.

[6]  Adele E. Howe,et al.  SAVVYSEARCH: A Metasearch Engine That Learns Which Search Engines to Query , 1997, AI Mag..

[7]  James C. French,et al.  Comparing the performance of collection selection algorithms , 2003, TOIS.

[8]  Dik Lun Lee,et al.  Server Ranking for Distributed Text Retrieval Systems on the Internet , 1997, DASFAA.

[9]  CallanJamie,et al.  Query-based sampling of text databases , 2001 .

[10]  Jack G. Conrad,et al.  Early user---system interaction for database selection in massive domain-specific online environments , 2003, TOIS.

[11]  Zhenyu Liu,et al.  A probabilistic approach to metasearching with adaptive probing , 2004, Proceedings. 20th International Conference on Data Engineering.

[12]  Gerhard Weikum,et al.  Improving collection selection with overlap awareness in P2P search engines , 2005, SIGIR '05.

[13]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[14]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[15]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[16]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[17]  Bernard J. Jansen,et al.  A review of web searching studies and a framework for future research , 2001 .

[18]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[19]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[20]  Ellen M. Voorhees,et al.  The Collection Fusion Problem , 1994, TREC.

[21]  Wes Dyer,et al.  Relevance and Overlap in Text Resource Selection , 2005 .

[22]  YerneniStanford,et al.  Maximizing Coverage of Mediated Web QueriesRamana , 2000 .

[23]  Subbarao Kambhampati,et al.  Improving text collection selection with coverage and overlap statistics , 2005, WWW '05.

[24]  Justin Zobel,et al.  Redundant documents and search effectiveness , 2005, CIKM '05.

[25]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .