The impact of database selection on distributed searching

The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts — database selection, query processing, and results merging. In this paper we examine the effect of database selection on retrieval performance. We look at retrieval performance in three different distributed retrieval testbeds and distill some general results. First we find that good database selection can result in better retrieval effectiveness than can be achieved in a centralized database. Second we find that good performance can be achieved when only a few sites are selected and that the performance generally increases as more sites are selected. Finally we find that when database selection is employed, it is not necessary to maintain collection wide information (CWI), e.g. global idf. Local information can be used to achieve superior performance. This means that distributed systems can be engineered with more autonomy and less cooperation. This work suggests that improvements in database selection can lead to broader improvements in retrieval performance, even in centralized (i.e. single database) systems. Given a centralized database and a good selection mechanism, retrieval performance can be improved by decomposing that database conceptually and employing a selection step.

[1]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[2]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[3]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[4]  Dik Lun Lee,et al.  Server Ranking for Distributed Text Retrieval Systems on the Internet , 1997, DASFAA.

[5]  James C. French,et al.  Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[6]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[7]  James C. French,et al.  Ensuring Retrieval Effectiveness in Distributed Digital Libraries , 1996, J. Vis. Commun. Image Represent..

[8]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[9]  Edward A. Fox,et al.  Combining Evidence from Multiple Searches , 1992, TREC.

[10]  James C. French,et al.  Evaluating database selection techniques: a testbed and experiment , 1998, SIGIR '98.

[11]  James Allan,et al.  INQUERY Does Battle With TREC-6 , 1997, TREC.

[12]  Ellen M. Voorhees,et al.  The Collection Fusion Problem , 1994, TREC.

[13]  Norbert Fuhr,et al.  A decision-theoretic approach to database selection in networked IR , 1999, TOIS.

[14]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[15]  Alistair Moffat,et al.  Information Retrieval Systems for Large Document Collections , 1994, TREC.

[16]  James C. French,et al.  The Effects of Query-Based Sampling on Automatic Database Selection Algorithms , 2000 .

[17]  David Hawking,et al.  Methods for information server selection , 1999, TOIS.

[18]  R. Lyman Ott.,et al.  An introduction to statistical methods and data analysis , 1977 .

[19]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[20]  Donna K. Harman,et al.  Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[21]  Ronald R. Yager,et al.  On the Fusion of Documents from Multiple Collection Information Retrieval Systems , 1998, J. Am. Soc. Inf. Sci..

[22]  James C. French,et al.  Dissemination of collection wide information in a distributed information retrieval system , 1995, SIGIR '95.

[23]  E. A. Fox,et al.  Combining the Evidence of Multiple Query Representations for Information Retrieval , 1995, Inf. Process. Manag..

[24]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[25]  David Hawking,et al.  Merging Results From Isolated Search Engines , 1999, Australasian Database Conference.