Statistical machine learning for information retrieval

The purpose of this work is to introduce and experimentally validate a framework, based on statistical machine learning, for handling a broad range of problems in information retrieval (IR). Probably the most important single component of this framework is a parametric statistical model of word relatedness. A longstanding problem in IR has been to develop a mathematically principled model for document processing which acknowledges that one sequence of words may be closely related to another even if the pair have few (or no) words in common. Until now, the word-relatedness problem has typically been addressed with techniques like automatic query expansion [75], an often successful though ad hoc technique which artificially injects new, related words into a document for the purpose of ensuring that related documents have some lexical overlap. In the past few years have emerged a number of novel probabilistic approaches to information processing—including the language modeling approach to document ranking suggested first by Ponte and Croft [67], the non-extractive summarization work of Mittal and Witbrock [87], and the Hidden Markov Model-based ranking of Miller et al. [61]. This thesis advances that body of work by proposing a principled, general probabilistic framework which naturally accounts for word-relatedness issues, using techniques from statistical machine learning such as the Expectation-Maximization (EM) algorithm [24]. Applying this new framework to the problem of ranking documents by relevancy to a query, for instance, we discover a model that contains a version of the Ponte and Miller models as a special case, but surpasses these in its ability to recognize the relevance of a document to a query even when the two have minimal lexical overlap. (Abstract shortened by UMI.)

[1]  Rin Saunders The Thallium Diagnostic Workstation: Learning to Diagnose Heart Imagery from Examples , 1991, IAAI.

[2]  David D. Lewis,et al.  Threading Electronic Mail - A Preliminary Study , 1997, Inf. Process. Manag..

[3]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[4]  Yllias Chali,et al.  Query-Biased Text Summarization as a Question-Answering Technique , 1999 .

[5]  James E. Rush,et al.  Improvement of automatic abstracts by the use of structural analysis , 1973, J. Am. Soc. Inf. Sci..

[6]  Dagobert Soergel,et al.  Multilingual Thesauri in Cross-Language Text and Speech Retrieval , 1997 .

[7]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[8]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Kathleen R. McKeown,et al.  Summarization Evaluation Methods: Experiments and Analysis , 1998 .

[11]  B. Merialdo,et al.  Tagging text with a probabilistic model , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[12]  L. A. Miller The Process of Question Answering - A Computer Simulation of Cognition , 1980, CL.

[13]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[14]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[15]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[16]  JAIST,et al.  Rewriting Saves Extracted Summaries , 2002 .

[17]  Daniel Marcu,et al.  Statistics-Based Summarization - Step One: Sentence Compression , 2000, AAAI/IAAI.

[18]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[19]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[20]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[21]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[22]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[23]  Ciprian Chelba,et al.  A Structured Language Model , 1997, ACL.

[24]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[25]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[26]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[27]  John D. Lafferty,et al.  The Candide System for Machine Translation , 1994, HLT.

[28]  John D. Lafferty,et al.  Towards History-based Grammars: Using Richer Models for Probabilistic Parsing , 1993, ACL.

[29]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[30]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[31]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[32]  R. Okafor Maximum likelihood estimation from incomplete data , 1987 .

[33]  Robert Miller,et al.  Just-in-time language modelling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[34]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[35]  Eduard Hovy,et al.  Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[36]  Inderjeet Mani,et al.  Machine Learning of Generic and User-Focused Summarization , 1998, AAAI/IAAI.

[37]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[38]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[39]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[40]  A. Poritz,et al.  Hidden Markov models: a guided tour , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[41]  Martin Franz,et al.  Machine translation and monolingual information retrieval (poster abstract) , 1999, SIGIR '99.

[42]  John D. Lafferty,et al.  The Weaver System for Document Retrieval , 1999, TREC.

[43]  Vibhu O. Mittal,et al.  OCELOT: a system for summarizing Web pages , 2000, SIGIR '00.

[44]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[45]  Vibhu O. Mittal,et al.  Bridging the lexical chasm: statistical approaches to answer-finding , 2000, SIGIR '00.

[46]  Michael Colclough The Process of Question Answering — A Computer Simulation of Cognition , 1979 .

[47]  Richard M. Schwartz,et al.  A Script-Independent Methodology For Optical Character Recognition , 1998, Pattern Recognit..

[48]  Robert L. Mercer,et al.  But Dictionaries Are Data Too , 1993, HLT.

[49]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[50]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[51]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[52]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[53]  Lance A. Miller,et al.  Review of The process of question answering: a computer simulation of cognition by Wendy G. Lehnert. Lawrence Erlbaum Associates 1978. , 1980 .

[54]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[55]  Daniel Marcu,et al.  From discourse structures to text summaries , 1997 .

[56]  David M. Magerman Natural Language Parsing as Statistical Pattern Recognition , 1994, ArXiv.

[57]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[58]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[59]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[61]  Frederick Jelinek,et al.  Exploiting Syntactic Structure for Language Modeling , 1998, ACL.

[62]  Vibhu O. Mittal,et al.  Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries (poster abstract). , 1998, SIGIR 1999.

[63]  L. Goddard Information Theory , 1962, Nature.

[64]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[65]  Therese Firmin Hand,et al.  A Proposal for Task-based Evaluation of Text Summarization Systems , 1997, Workshop On Intelligent Scalable Text Summarization.

[66]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[67]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[68]  K. Sparck Jones,et al.  A Probabilistic Model of Information Retrieval : Development and Status , 1998 .

[69]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[70]  Hiroshi Maruyama,et al.  Real-time on-line unconstrained handwriting recognition using statistical methods , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[71]  Efthimis N. Efthimiadis,et al.  UCLA-Okapi at TREC-2: Query Expansion Experiments , 1993, TREC.

[72]  Vibhu O. Mittal,et al.  Query-Relevant Summarization using FAQs , 2000, ACL.

[73]  Kristian J. Hammond,et al.  Question Answering from Frequently Asked Question Files: Experiences with the FAQ FINDER System , 1997, AI Mag..

[74]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.