Improving text collection selection with coverage and overlap statistics

In an environment of distributed text collections, the first step in the information retrieval process is to identify which of all available collections are more relevant to a given query and which should thus be accessed to answer the query. We address the challenge of collection selection when there is full or partial overlap between the available text collections, a scenario which has not been examined previously despite its real-world applications. To that end, we present COSCO, a collection selection approach which uses collection-specific coverage and overlap statistics. We describe our experimental results which show that the presented approach displays the desired behavior of retrieving more new results early on in the collection order, and performs consistently and significantly better than CORI, previously considered to be one of the best collection selection systems.

[1]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[2]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[3]  Luis Gravano,et al.  The effectiveness of GIOSS for the text database discovery problem , 1994, SIGMOD '94.

[4]  Subbarao Kambhampati,et al.  Frequency-Based Coverage Statistics Mining for Data Integration , 2003, IIWeb.

[5]  Bernard J. Jansen,et al.  A review of Web searching studies and a framework for future research , 2001, J. Assoc. Inf. Sci. Technol..

[6]  Alon Y. Halevy,et al.  A model for data integration systems of biomedical data applied to online genetic databases , 2001, AMIA.

[7]  Jack G. Conrad,et al.  Early user---system interaction for database selection in massive domain-specific online environments , 2003, TOIS.

[8]  Ellen M. Voorhees,et al.  Multiple search engines in database merging , 1997, DL '97.

[9]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[10]  King-Lup Liu,et al.  Finding the most similar documents across multiple text databases , 1999, Proceedings IEEE Forum on Research and Technology Advances in Digital Libraries.

[11]  Zhenyu Liu,et al.  A probabilistic approach to metasearching with adaptive probing , 2004, Proceedings. 20th International Conference on Data Engineering.

[12]  James C. French,et al.  Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[13]  Ellen M. Voorhees,et al.  The Collection Fusion Problem , 1994, TREC.

[14]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[15]  Clement T. Yu,et al.  Towards a highly-scalable and effective metasearch engine , 2001, WWW '01.

[16]  Juraj Hromkovic,et al.  Algorithmics for Hard Problems , 2002, Texts in Theoretical Computer Science An EATCS Series.

[17]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[18]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[19]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[20]  James C. French,et al.  Comparing the performance of collection selection algorithms , 2003, TOIS.

[21]  Jean E. Sammet,et al.  Progress report on the ACM Guide to Computing Literature , 1984, CACM.

[22]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[23]  Judith Sylvester,et al.  CNN , 2003 .

[24]  Dik Lun Lee,et al.  Server Ranking for Distributed Text Retrieval Systems on the Internet , 1997, DASFAA.

[25]  Subbarao Kambhampati,et al.  A frequency-based approach for mining coverage statistics in data integration , 2004, Proceedings. 20th International Conference on Data Engineering.

[26]  Bernard Rous,et al.  The ACM digital library , 2001, CACM.

[27]  Hector Garcia-Molina,et al.  Finding near-replicas of documents on the Web , 1999 .

[28]  Peter Mork,et al.  The BioMediator System as a Tool for Integrating Biologic Databases on the Web , 2004 .

[29]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[30]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[31]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[32]  Adele E. Howe,et al.  SAVVYSEARCH: A Metasearch Engine That Learns Which Search Engines to Query , 1997, AI Mag..

[33]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[34]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[35]  King-Lup Liu,et al.  A Statistical Method for Estimating the Usefulness of Text Databases , 2002, IEEE Trans. Knowl. Data Eng..

[36]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[37]  Yuefeng Li,et al.  Web based collection selection using singular value decomposition , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[38]  YerneniStanford,et al.  Maximizing Coverage of Mediated Web QueriesRamana , 2000 .