Using language models for information retrieval

Because of the world wide web, information retrieval systems are now used by millions of untrained users all over the world. The search engines that perform the information retrieval tasks, often retrieve thousands of potentially interesting documents to a query. The documents should be ranked in decreasing order of relevance in order to be useful to the user. This book describes a mathematical model of information retrieval based on the use of statistical language models. The approach uses simple document-based unigram models to compute for each document the probability that it generates the query. This probability is used to rank the documents. The study makes the following research contributions. * The development of a model that integrates term weighting, relevance feedback and structured queries. * The development of a model that supports multiple representations of a request or information need by integrating a statistical translation model. * The development of a model that supports multiple representations of a document, for instance by allowing proximity searches or searches for terms from a particular record field (e.g. a search for terms from the title). * A mathematical interpretation of stop word removal and stemming. * A mathematical interpretation of operators for mandatory terms, wildcards and synonyms. * A practical comparison of a language model-based retrieval system with similar systems that are based on well-established models and term weighting algorithms in a controlled experiment. * The application of the model to cross-language information retrieval and adaptive information filtering, and the evaluation of two prototype systems in a controlled experiment. Experimental results on three standard tasks show that the language model-based algorithms work as well as, or better than, today's top-performing retrieval algorithms. The standard tasks investigated are ad-hoc retrieval (when there are no previously retrieved documents to guide the search), retrospective relevance weighting (find the optimum model for a given set of relevant documents), and ad-hoc retrieval using manually formulated Boolean queries. The application to cross-language retrieval and adaptive filtering shows the practical use of respectively structured queries, and relevance feedback.

[1]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[2]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[3]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[4]  Brian Vickery,et al.  Techniques of information retrieval , 1970 .

[5]  J. H. V. Dale,et al.  Van Dale groot woordenboek der Nederlandse taal , 1970 .

[6]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[7]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[8]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[9]  H. J. Larson,et al.  Introduction to the Theory of Statistics , 1973 .

[10]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[11]  Don R. Swanson,et al.  Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[12]  Tefko Saracevic,et al.  RELEVANCE: A review of and a framework for the thinking on the notion in information science , 1997, J. Am. Soc. Inf. Sci..

[13]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[16]  T. de Heer Quasi comprehension of natural language simulated by means of information traces , 1979, Inf. Process. Manag..

[17]  Stephen E. Robertson,et al.  Probabilistic models of indexing and searching , 1980, SIGIR '80.

[18]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[19]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[20]  G. Salton,et al.  Extended Boolean information retrieval , 1983, CACM.

[21]  C. Paice Soft evaluation of Boolean search queries in information retrieval systems , 1984 .

[22]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[23]  C. J. van Rijsbergen,et al.  A Non-Classical Logic for Information Retrieval , 1997, Comput. J..

[24]  Nicholas J. Belkin,et al.  Retrieval techniques , 1987 .

[25]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[26]  Gerald Salton,et al.  Automatic text processing , 1988 .

[27]  李幼升,et al.  Ph , 1989 .

[28]  Richard Saul Wurman,et al.  Information Anxiety 2 , 1989 .

[29]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[30]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[31]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[32]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[33]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[34]  David Heckerman,et al.  Probabilistic similarity networks , 1991, Networks.

[35]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[36]  Chris Buckley,et al.  A probabilistic learning approach for document indexing , 1991, TOIS.

[37]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[38]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[39]  Ricardo A. Baeza-Yates,et al.  Introduction to Data Structures and Algorithms Related to Information Retrieval , 1992, Information Retrieval: Data Structures & Algorithms.

[40]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[41]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[42]  Peter Bruza,et al.  Stratified Hypermedia Structures for Information Disclosure , 1992, Comput. J..

[43]  W. Bruce Croft,et al.  A Comparison of Text Retrieval Models , 1992, Comput. J..

[44]  Donna K. Harman,et al.  Ranking Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[45]  Edward A. Fox,et al.  Inverted Files , 1992, Information Retrieval: Data Structures & Algorithms.

[46]  James Allan,et al.  Automatic Routing and Ad-hoc Retrieval Using SMART: TREC 2 , 1993, TREC.

[47]  W. Bruce Croft Knowledge-based and statistical approaches to text retrieval , 1993, IEEE Expert.

[48]  W. Bruce Croft,et al.  INQUERY System Overview , 1993, TIPSTER.

[49]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[50]  Eugene L. Margulis,et al.  Modelling Documents with Multiple Poisson Distributions , 1993, Inf. Process. Manag..

[51]  Fabrizio Sebastiani,et al.  A probabilistic terminological logic for modelling information retrieval , 1994, SIGIR '94.

[52]  David Elworthy,et al.  Does Baum-Welch Re-estimation Help Taggers? , 1994, ANLP.

[53]  Fredric C. Gey,et al.  Inferring probability of relevance using the method of logistic regression , 1994, SIGIR '94.

[54]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[55]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[56]  W. Bruce Croft,et al.  Document Retrieval and Routing Using the INQUERY System , 1994, TREC.

[57]  François Schiettecatte,et al.  Document Retrieval Using The MPS Information Server (A Report on the TREC-4 Experiment) , 1995, TREC.

[58]  Yiyu Yao,et al.  On modeling information retrieval with probabilistic inference , 1995, TOIS.

[59]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[60]  W. Bruce Croft,et al.  Combining Automatic and Manual Index Representations in Probabilistic Retrieval , 1995, J. Am. Soc. Inf. Sci..

[61]  J. Lee Analyzing the Effectiveness of Extended Boolean Models in Information Retrieval , 1995 .

[62]  Norbert Fuhr,et al.  Probabilistic Datalog—a logic for powerful retrieval methods , 1995, SIGIR '95.

[63]  Ron Sacks-Davis,et al.  Similarity Measures for Short Queries , 1995, TREC.

[64]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[65]  Douglas W. Oard,et al.  A survey of multilingual text retrieval , 1996 .

[66]  Everett H. Brenner Beyond Boolean: New Approaches to Information Retrieval , 1996 .

[67]  Theo Huibers,et al.  An axiomatic theory for information retrieval , 1996 .

[68]  Berthier A. Ribeiro-Neto,et al.  A belief network model for IR , 1996, SIGIR '96.

[69]  David Hawking,et al.  Relevance weighting using distance between term occurrences , 1996 .

[70]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[71]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[72]  Jean Tague-Sutcliffe,et al.  Some Perspectives on the Evaluation of Information Retrieval Systems , 1996, J. Am. Soc. Inf. Sci..

[73]  Daniel E. Rose,et al.  V-Twin: A Lightweight Engine for Interactive Use , 1996, TREC.

[74]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[75]  W. Bruce Croft,et al.  Computationally tractable probabilistic modeling of Boolean operators , 1997, SIGIR '97.

[76]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[77]  Djoerd Hiemstra,et al.  Cross Language Retrieval with the Twenty-One system , 1997, TREC.

[78]  Stephen E. Robertson,et al.  On relevance weights with little relevance information , 1997, SIGIR '97.

[79]  S. Robertson The probability ranking principle in IR , 1997 .

[80]  Carol Peters,et al.  Cross-Language Information Retrieval (CLIR) Track Overview , 1997, TREC.

[81]  Djoerd Hiemstra,et al.  A domain Specific Lexicon Acquisition Tool for Cross-Language Information Retrieval , 1997, RIAO.

[82]  Stefano Mizzaro Relevance: the whole history , 1997 .

[83]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[84]  Djoerd Hiemstra Multilingual domain modeling in Twenty-One: automatic creation of a bi-directional translation lexicon from a parallel corpus , 1997 .

[85]  Peter Jansen,et al.  Threshold Calibration in CLARIT Adaptive Filtering , 1998, TREC.

[86]  Jung-Fu Cheng,et al.  Turbo Decoding as an Instance of Pearl's "Belief Propagation" Algorithm , 1998, IEEE J. Sel. Areas Commun..

[87]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[88]  Djoerd Hiemstra,et al.  Cross-language retrieval in Twenty-One: using one, some or all possible translations? , 1998 .

[89]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[90]  Wessel Kraaij,et al.  TNO TREC7 Site Report: SDR and Filtering , 1998, TREC.

[91]  Douglas W. Oard,et al.  A comparative study of query and document translation for cross-language information retrieval , 1998, AMTA.

[92]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[93]  Wessel Kraaij,et al.  Comparing the Effect of Syntactic vs. Statistical Phrase Indexing Strategies for Dutch , 1998, ECDL.

[94]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[95]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[96]  Daphne Koller,et al.  Using machine learning to improve information access , 1998 .

[97]  Bas van Bakel Modern classical document indexing: a linguistic contribution to knowledge-based IR , 1998, SIGIR '98.

[98]  Arjen P. de Vries,et al.  The Mirror DBMS at TREC-8 , 1999, TREC.

[99]  David E. Losada,et al.  Using a belief revision operator for document ranking in extended Boolean models , 1999, SIGIR '99.

[100]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[101]  Arjen P. de Vries,et al.  Content and multimedia database management systems , 1999 .

[102]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[103]  Paul J. Krause,et al.  Learning probabilistic networks , 1999, The Knowledge Engineering Review.

[104]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[105]  James Allan,et al.  INQUERY and TREC-8 , 1998, TREC.

[106]  Jaana Kekäläinen,et al.  The effects of query complexity, expansion and structure on retrieval performance in probabilistic text retrieval , 1999 .

[107]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[108]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[109]  Djoerd Hiemstra,et al.  Twenty-One at TREC-8: using Language Technology for Information Retrieval , 1999, TREC.

[110]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[111]  Kenney Ng A Maximum Likelihood Ratio Information Retrieval Model , 1999, TREC.

[112]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[113]  Djoerd Hiemstra,et al.  Disambiguation Strategies for Cross-Language Information Retrieval , 1999, ECDL.

[114]  Robert J. McEliece,et al.  The generalized distributive law , 2000, IEEE Trans. Inf. Theory.

[115]  Djoerd Hiemstra,et al.  Relating the new language models of information retrieval to the traditional retrieval models , 2000 .

[116]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[117]  Charles L. A. Clarke,et al.  Relevance ranking for one to three term queries , 1997, Inf. Process. Manag..

[118]  Akiko Aizawa The feature quantity: an information theoretic perspective of Tfidf-like measures , 2000, SIGIR '00.

[119]  Djoerd Hiemstra,et al.  Language-Based Multimedia Information Retrieval , 2000, RIAO.

[120]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[121]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[122]  E. Dura Natural Language in Information Retrieval , 2003, CICLing.