A decision-theoretic approach to database selection in networked IR

In networked IR, a client submits a query to a broker, which is in contact with a large number of databases. In order to yield a maximum number of documents at minimum cost, the broker has to make estimates about the retrieval cost of each database, and then decide for each database whether or not to use it for the current query, and if, how many documents to retrieve from it. For this purpose, we develop a general decision-theoretic model and discuss different cost structures. Besides cost for retrieving relevant versus nonrelevant documents, we consider the following parameters for each database: expected retrieval quality, expected number of relevant documents in the database and cost factors for query processing and document delivery. For computing the overall optimum, a divide-and-conquer algorithm is given. If there are several brokers knowing different databases, a preselection of brokers can only be performed heuristically, but the computation of the optimum can be done similarily to the single-broker case. In addition, we derive a formula which estimates the number of relevant documents in a database based on dictionary information.

[1]  N. Fuhr Extending Probabilistic Datalog , 1996 .

[2]  Peter B. Danzig,et al.  Harvest: A Scalable, Customizable Discovery and Access System , 1994 .

[3]  Norbert Fuhr,et al.  Object-oriented and database concepts for the design of networked information retrieval systems , 1996, CIKM '96.

[4]  Peter B. Danzig,et al.  Distributed Indexing of Autonomous Internet Services , 1992, Comput. Syst..

[5]  James C. French,et al.  Evaluating database selection techniques: a testbed and experiment , 1998, SIGIR '98.

[6]  Abraham Bookstein,et al.  Outline of a General Probabilistic Retrieval Model , 1983, J. Documentation.

[7]  Thomas Erickson,et al.  Interfaces for Distributed Systems of Information Servers , 1993, J. Am. Soc. Inf. Sci..

[8]  C. J. van Rijsbergen,et al.  Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval , 1987, SIGIR 1987.

[9]  C. J. van Rijsbergen,et al.  A Non-Classical Logic for Information Retrieval , 1997, Comput. J..

[10]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[11]  Peter B. Danzig,et al.  The Harvest Information Discovery and Access System , 1995, Comput. Networks ISDN Syst..

[12]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[13]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[14]  S. Robertson The probability ranking principle in IR , 1997 .

[15]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[16]  Norbert Fuhr,et al.  Students access books and journals through MeDoc , 1998, CACM.

[17]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[18]  Edmund H. Durfee,et al.  Toward Inquiry-Based Education Through Interacting Software Agents , 1996, Computer.

[19]  Norbert Fuhr,et al.  Provider Selection - Design and Implementation of the Medoc Broker , 1998, The MeDoc Approach.

[20]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[21]  Thomas Erickson,et al.  Interfaces for Distributed Systems of Information Servers , 1993, J. Am. Soc. Inf. Sci..

[22]  Norbert Fuhr,et al.  Optimum polynomial retrieval functions based on the probability ranking principle , 1989, TOIS.

[23]  Christoph Baumgarten,et al.  A probabilistic model for distributed information retrieval , 1997, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[24]  Dietrich Boles,et al.  MeDoc Information Broker - Harnessing the Information in Literature and Full Text Databases , 1996, Networked Information Retrieval.

[25]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[26]  Yiyu Yao,et al.  On modeling information retrieval with probabilistic inference , 1995, TOIS.