Is CORI Effective for Collection Selection? An Exploration of Parameters, Queries, and Data

In distributed information retrieval, a wide range of techniques have been proposed for choosing collections to interrogate. Many of these collection-selection techniques are based on ranking the lexicons; of these, arguably the best known is the CORI collection ranking metric, which includes several parameters that, in principle, should be tuned for different data sets. However, parameters chosen in early work on CORI have been used without alteration in almost all subsequent work, despite drastic differences in the data collections. We have explored the behaviour of CORI for a range of data sets and parameter values. It appears that parameters cannot reliably be chosen for CORI: not only do the optimal choices vary between data sets, but they also vary between query types and, indeed, vary wildly within query sets. Coupled with the observation that even CORI with optimal parameters is usually less effective than other methods, we conclude that the use of CORI as a benchmark collection selection method is inappropriate.

[1]  James Allan,et al.  Recent Experiments with INQUERY , 1995, TREC.

[2]  James C. French,et al.  Evaluating database selection techniques: a testbed and experiment , 1998, SIGIR '98.

[3]  Justin Zobel,et al.  Collection Selection via Lexicon Inspection , 1997 .

[4]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[5]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[6]  Jacques Savoy,et al.  Approaches to collection selection and results merging for distributed information retrieval , 2001, CIKM '01.

[7]  Daryl J. D'Souza,et al.  Collection selection for managed distributed document databases , 2004, Inf. Process. Manag..

[8]  W. Bruce Croft Advances in Informational Retrieval: Recent Research from the Center for Intelligent Information Retrieval , 2000 .

[9]  Luo Si,et al.  Using sampled data and regression to merge search engine results , 2002, SIGIR '02.

[10]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[11]  James C. French,et al.  Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[12]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[13]  Peter Jackson,et al.  Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-specific Operational Environment , 2002, VLDB.

[14]  David Hawking,et al.  Methods for information server selection , 1999, TOIS.

[15]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[16]  Michel Beigbeder,et al.  A methodology for collection selection in heterogeneous contexts , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[17]  Dik Lun Lee,et al.  Server Ranking for Distributed Text Retrieval Systems on the Internet , 1997, DASFAA.

[18]  Peter Bailey,et al.  Server selection on the World Wide Web , 2000, DL '00.

[19]  James P. Callan,et al.  Collection selection and results merging with topically organized U.S. patents and TREC data , 2000, CIKM '00.

[20]  Kathryn S. McKinley,et al.  Partial replica selection based on relevance for information retrieval , 1999, SIGIR '99.

[21]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[22]  James C. French,et al.  The impact of database selection on distributed searching , 2000, SIGIR '00.