A probabilistic model of information retrieval: development and comparative experiments - Part 1

The paper combines a comprehensive account of the probabilistic model of retrieval with new systematic experiments on TREC Programme material. It presents the model from its foundations through its logical development to cover more aspects of retrieval data and a wider range of system functions. Each step in the argument is matched by comparative retrieval tests, to provide a single coherent account of a major line of research. The experiments demonstrate, for a large test collection, that the probabilistic model is eAective and robust, and that it responds appropriately, with major improvements in performance, to key features of retrieval situations. Part 1 covers the foundations and the model development for document collection and relevance data, along with the test apparatus. Part 2 covers the further development and elaboration of the model, with extensive testing, and briefly considers other environment conditions and tasks, model training, concluding with comparisons with other approaches and an overall assessment. Data and results tables for both parts are given in Part 1. Key results are summarised in Part 2. 7 2000 Elsevier Science Ltd. All rights reserved.

[1]  Norbert Fuhr,et al.  The automatic indexing system AIR/PHYS - from research to applications , 1988, SIGIR '88.

[2]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[3]  Ellen M. Voorhees,et al.  The Sixth Text REtrieval Conference (TREC-6) , 2000, Inf. Process. Manag..

[4]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[5]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[6]  Gerard Salton,et al.  A theory of indexing , 1975, Regional conference series in applied mathematics.

[7]  K. Sparck Jones,et al.  A TEST FOR THE SEPARATION OF RELEVANT AND NON‐RELEVANT DOCUMENTS IN EXPERIMENTAL RETRIEVAL COLLECTIONS , 1973 .

[8]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[9]  William S. Cooper,et al.  Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval , 1995, TOIS.

[10]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[11]  Stephen E. Robertson,et al.  On Term Selection for Query Expansion , 1991, J. Documentation.

[12]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[13]  Stephen E. Robertson,et al.  Probabilistic models of indexing and searching , 1980, SIGIR '80.

[14]  David A. Evans,et al.  Clarit-TREC Experiments , 1995, Inf. Process. Manag..

[15]  S. Robertson The probability ranking principle in IR , 1997 .

[16]  Karen Sparck Jones,et al.  Spoken Document Retrieval for TREC-8 at Cambridge University , 1998, TREC.

[17]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[18]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[19]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[20]  Gerard Salton,et al.  Automatic Routing and Retrieval Using Smart: TREC-2 , 1995, Inf. Process. Manag..

[21]  Stephen E. Robertson,et al.  Threshold setting in adaptive filtering , 2000, J. Documentation.

[22]  Nicholas J. Belkin,et al.  Ranking in Principle , 1978, J. Documentation.

[23]  Karen Sparck Jones What is the Role of NLP in Text Retrieval , 1999 .

[24]  Karen Sparck Jones A PERFORMANCE YARDSTICK FOR TEST COLLECTIONS , 1975 .

[25]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[26]  Karen Spärck Jones Search Term Relevance Weighting given Little Relevance Information , 1997, J. Documentation.

[27]  Karen Spärck Jones,et al.  Information Retrieval and Artificial Intelligence , 1999, Artif. Intell..

[28]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[29]  Karen Spärck Jones Further reflections on TREC , 2000, Inf. Process. Manag..

[30]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[31]  Stephen E. Robertson,et al.  On relevance weights with little relevance information , 1997, SIGIR '97.

[32]  David D. Lewis,et al.  The TREC-5 Filtering Track , 1996, TREC.

[33]  Donna Harman,et al.  The Second Text Retrieval Conference (TREC-2) , 1995, Inf. Process. Manag..

[34]  C. J. van Rijsbergen,et al.  A Non-Classical Logic for Information Retrieval , 1997, Comput. J..

[35]  Karen Spärck Jones,et al.  Retrieving spoken documents by combining multiple index sources , 1996, SIGIR '96.

[36]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[37]  Alan F. Smeaton,et al.  Spanish and Chinese Document Retrieval in TREC-5 , 1996, TREC.

[38]  Stephen E. Robertson,et al.  Okapi at TREC-6 Automatic ad hoc, VLC, routing, filtering and QSDR , 1997, TREC.

[39]  Robert N. Oddy,et al.  Information Retrieval Research , 1982 .

[40]  Karen Spärck Jones Experiments in relevance weighting of search terms , 1979, Inf. Process. Manag..

[41]  Tomek Strzalkowski Natural Language Information Retrieval , 1995, Inf. Process. Manag..

[42]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[43]  Ellen M. Voorhees,et al.  The seventh text REtrieval conference (TREC-7) , 1999 .

[44]  F. W. Lancaster,et al.  MEDLARS: Report on the Evaluation of Its Operating Efficiency. , 1997 .

[45]  W. Bruce Croft,et al.  TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[46]  Fabrizio Sebastiani,et al.  Trends in ... a Critical Review: On the Role of Logic in Information Retrieval , 1998, Inf. Process. Manag..

[47]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[48]  Ellen M. Voorhees,et al.  The fifth text REtrieval conference (TREC-5) , 1997 .

[49]  C. J. van Rijsbergen,et al.  An Evaluation of feedback in Document Retrieval using Co‐Occurrence Data , 1978, J. Documentation.

[50]  Karen Spärck Jones,et al.  Experiments in Spoken Document Retrieval , 1996, Inf. Process. Manag..

[51]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[52]  Kui-Lam Kwok,et al.  A network approach to probabilistic information retrieval , 1995, TOIS.

[53]  K. Sparck Jones,et al.  A Probabilistic Model of Information Retrieval : Development and Status , 1998 .

[54]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[55]  Fredric C. Gey,et al.  Full Text Retrieval based on Probalistic Equations with Coefficients fitted by Logistic Regression , 1993, TREC.

[56]  Chris Buckley,et al.  A probabilistic learning approach for document indexing , 1991, TOIS.

[57]  Gerard Salton,et al.  Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[58]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[59]  Gerard Salton,et al.  Improving Retrieval Performance by Relevance Feedback , 1997 .

[60]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[61]  David R. Cox The analysis of binary data , 1970 .

[62]  Donna Harman,et al.  The First Text REtrieval Conference (TREC-1) , 1993 .

[63]  Norbert Fuhr,et al.  Probalistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection , 1993, TREC.

[64]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing , 1975, J. Am. Soc. Inf. Sci..

[65]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.