Comparing the performance of collection selection algorithms

The proliferation of online information resources increases the importance of effective and efficient information retrieval in a multicollection environment. Multicollection searching is cast in three parts: collection selection (also referred to as database selection), query processing and results merging. In this work, we focus our attention on the evaluation of the first step, collection selection.In this article, we present a detailed discussion of the methodology that we used to evaluate and compare collection selection approaches, covering both test environments and evaluation measures. We compare the CORI, CVV and gGLOSS collection selection approaches using six test environments utilizing three document testbeds. We note similar trends in performance among the collection selection approaches, but the CORI approach consistently outperforms the other approaches, suggesting that effective collection selection can be achieved using limited information about each collection.The contributions of this work are both the assembled evaluation methodology as well as the application of that methodology to compare collection selection approaches in a standardized environment.

[1]  James C. French,et al.  Metrics for evaluating database selection techniques , 2004, World Wide Web.

[2]  Amanda Spink,et al.  Interaction in information retrieval: selection and effectiveness of search terms , 1997 .

[3]  King-Lup Liu,et al.  Estimating the usefulness of search engines , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[4]  Ellen M. Voorhees,et al.  The Collection Fusion Problem , 1994, TREC.

[5]  Divyakant Agrawal,et al.  Pharos: a scalable distributed architecture for locating heterogeneous information sources , 1997, CIKM '97.

[6]  James C. French,et al.  Effective and Efficient Automatic Database Selection , 1999 .

[7]  Jian Xu,et al.  ZBroker: a query routing broker for Z39.50 databases , 1999, CIKM '99.

[8]  Edward A. Fox,et al.  Multilingual Federated Searching Across Heterogeneous Collections , 1998, D Lib Mag..

[9]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[10]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[11]  King-Lup Liu,et al.  Finding the most similar documents across multiple text databases , 1999, Proceedings IEEE Forum on Research and Technology Advances in Digital Libraries.

[12]  Norbert Fuhr,et al.  A decision-theoretic approach to database selection in networked IR , 1999, TOIS.

[13]  King-Lup Liu,et al.  A Statistical Method for Estimating the Usefulness of Text Databases , 2002, IEEE Trans. Knowl. Data Eng..

[14]  Luis Gravano,et al.  STARTS: Stanford proposal for Internet meta-searching , 1997, SIGMOD '97.

[15]  Peter Bailey,et al.  Server selection on the World Wide Web , 2000, DL '00.

[16]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[17]  Guijun Wang,et al.  ProFusion*: Intelligent Fusion from Multiple, Distributed Search Engines , 1996, J. Univers. Comput. Sci..

[18]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[19]  Abdulla Ghaleb,et al.  Characterizing World Wide Web Queries , 1997 .

[20]  Divyakant Agrawal,et al.  Using Automated Classification for Summarizing and Selecting Heterogeneous Information Sources , 1998, D Lib Mag..

[21]  James C. French Modeling web data , 2002, JCDL '02.

[22]  Gonzalo Navarro,et al.  Large text searching allowing errors , 1997 .

[23]  Dik Lun Lee,et al.  Server Ranking for Distributed Text Retrieval Systems on the Internet , 1997, DASFAA.

[24]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[25]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[26]  Ellen M. Voorhees,et al.  Siemens TREC-4 Report: Further Experiments with Database Merging , 1995, TREC.

[27]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[28]  James C. French,et al.  The Effects of Query-Based Sampling on Automatic Database Selection Algorithms , 2000 .

[29]  James C. French,et al.  Database selection in distributed information retrieval: a study of multi-collection information retrieval , 2001 .

[30]  Justin Zobel,et al.  Collection Selection via Lexicon Inspection , 1997 .

[31]  Luis Gravano,et al.  Data structures for efficient broker implementation , 1997, TOIS.

[32]  C. Buckley SMART version 11.0 , 1992 .

[33]  King-Lup Liu,et al.  Determining Text Databases to Search in the Internet , 1998, VLDB.

[34]  Divyakant Agrawal,et al.  Scalable collection summarization and selection , 1999, DL '99.

[35]  L PowellAllison,et al.  Comparing the performance of collection selection algorithms , 2003 .

[36]  Jian Xu,et al.  Database selection techniques for routing bibliographic queries , 1998, DL '98.

[37]  Alistair Moffat,et al.  Information Retrieval Systems for Large Document Collections , 1994, TREC.

[38]  Amanda Spink,et al.  Interaction in Information Retrieval: Selection and Effectiveness of Search Terms , 1997, J. Am. Soc. Inf. Sci..

[39]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[40]  James C. French,et al.  Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[41]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[42]  Christoph Baumgarten,et al.  A probabilistic model for distributed information retrieval , 1997, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[43]  Christoph Baumgarten,et al.  A probabilistic solution to the selection and fusion problem in distributed information retrieval , 1999, SIGIR '99.

[44]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[45]  Adele E. Howe,et al.  Experiences with selecting search engines using metasearch , 1997, TOIS.

[46]  Ellen M. Voorhees,et al.  Multiple search engines in database merging , 1997, DL '97.

[47]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[48]  James C. French,et al.  The impact of database selection on distributed searching , 2000, SIGIR '00.

[49]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[50]  King-Lup Liu,et al.  Efficient and effective metasearch for a large number of text databases , 1999, CIKM '99.

[51]  James C. French,et al.  Evaluating database selection techniques: a testbed and experiment , 1998, SIGIR '98.

[52]  David Hawking,et al.  Methods for information server selection , 1999, TOIS.