Predicting the cost-quality trade-off for information retrieval queries: facilitating database design and query optimization

Efficient, flexible, and scalable integration of full text information retrieval (IR) in a DBMS is not a trivial case. This holds in particular for query optimization in such a context. To facilitate the bulk-oriented behavior of database query processing, a priori knowledge of how to limit the data efficiently prior to query evaluation is very valuable at optimization time. The usually imprecise nature of IR querying provides an extra opportunity to limit the data by a trade-off with the quality of the answer. In this paper we present a mathematically derived model to predict the quality implications of neglecting information before query execution. In particular we investigate the possibility to predict the retrieval quality for a document collection for which no training information is available, which is usually the case in practice. Instead, we construct a model that can be trained on other document collections for which the necessary quality information is available, or can be obtained quite easily. We validate our model for several document collections and present the experimental results. These results show that our model performs quite well, even for the case were we did not train it on the test collection itself.

[1]  Michael J. Carey,et al.  Reducing the Braking Distance of an SQL Query Engine , 1998, VLDB.

[2]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[3]  Christos Faloutsos,et al.  Proceedings of the 1999 ACM SIGMOD international conference on Management of data , 1999, SIGMOD 1999.

[4]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[5]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[6]  Peter M. G. Apers,et al.  A selectivity model for fragmented relations in information retrieval , 2001 .

[7]  Divesh Srivastava,et al.  Evaluating Answer Quality/Efficiency Tradeoffs , 1998, KRDB.

[8]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[9]  Ophir Frieder,et al.  Efficiency Considerations for Scalable Information Retrieval Servers , 2006, J. Digit. Inf..

[10]  Arjen P. de Vries,et al.  Content and multimedia database management systems , 1999 .

[11]  Eric W. Brown,et al.  Execution performance issues in full-text information retrieval , 1995 .

[12]  Michael Persin,et al.  Document filtering for fast ranking , 1994, SIGIR '94.

[13]  Arjen P. de Vries,et al.  Experiences with IR TOP N Optimization in a Main Memory DBMS: Applying 'the Database Approach' in New Domains , 2001, BNCOD.

[14]  A. N. Wilschut,et al.  On the integration of IR and Databases , 1999 .

[15]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[16]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[17]  Jan O. Pedersen,et al.  Space Optimizations for Total Ranking , 1997, RIAO.

[18]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[19]  Arjen P. de Vries,et al.  The Relationship between IR and Multimedia Databases , 1998, BCS-IRSG Annual Colloquium on IR Research.

[20]  Ellen M. Voorhees,et al.  The seventh text REtrieval conference (TREC-7) , 1999 .

[21]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[22]  Luis Gravano,et al.  Evaluating Top-k Selection Queries , 1999, VLDB.

[23]  Raghu Ramakrishnan,et al.  Probabilistic Optimization of Top N Queries , 1999, VLDB.

[24]  Divesh Srivastava,et al.  Interaction of query evaluation and buffer management for information retrieval , 1998, SIGMOD '98.

[25]  J. Huisman The Netherlands , 1996, The Lancet.

[26]  Ophir Frieder,et al.  A Parallel DBMS Approach to IR in TREC-3 , 1994, TREC.

[27]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .