论文信息 - Statistical machine learning for information retrieval

Statistical machine learning for information retrieval

The purpose of this work is to introduce and experimentally validate a framework, based on statistical machine learning, for handling a broad range of problems in information retrieval (IR). Probably the most important single component of this framework is a parametric statistical model of word relatedness. A longstanding problem in IR has been to develop a mathematically principled model for document processing which acknowledges that one sequence of words may be closely related to another even if the pair have few (or no) words in common. Until now, the word-relatedness problem has typically been addressed with techniques like automatic query expansion [75], an often successful though ad hoc technique which artificially injects new, related words into a document for the purpose of ensuring that related documents have some lexical overlap. In the past few years have emerged a number of novel probabilistic approaches to information processing—including the language modeling approach to document ranking suggested first by Ponte and Croft [67], the non-extractive summarization work of Mittal and Witbrock [87], and the Hidden Markov Model-based ranking of Miller et al. [61]. This thesis advances that body of work by proposing a principled, general probabilistic framework which naturally accounts for word-relatedness issues, using techniques from statistical machine learning such as the Expectation-Maximization (EM) algorithm [24]. Applying this new framework to the problem of ranking documents by relevancy to a query, for instance, we discover a model that contains a version of the Ponte and Miller models as a special case, but surpasses these in its ability to recognize the relevance of a document to a query even when the two have minimal lexical overlap. (Abstract shortened by UMI.)

John Lafferty | Adam L. Berger | J. Lafferty | A. Berger

[1] Rin Saunders. The Thallium Diagnostic Workstation: Learning to Diagnose Heart Imagery from Examples , 1991, IAAI.

[2] David D. Lewis,et al. Threading Electronic Mail - A Preliminary Study , 1997, Inf. Process. Manag..

[3] Richard M. Schwartz,et al. A hidden Markov model information retrieval system , 1999, SIGIR '99.

[4] Yllias Chali,et al. Query-Biased Text Summarization as a Question-Answering Technique , 1999 .

[5] James E. Rush,et al. Improvement of automatic abstracts by the use of structural analysis , 1973, J. Am. Soc. Inf. Sci..

[6] Dagobert Soergel,et al. Multilingual Thesauri in Cross-Language Text and Speech Retrieval , 1997 .

[7] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[8] Claude E. Shannon,et al. Prediction and Entropy of Printed English , 1951 .

[9] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10] Kathleen R. McKeown,et al. Summarization Evaluation Methods: Experiments and Analysis , 1998 .

[11] B. Merialdo,et al. Tagging text with a probabilistic model , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[12] L. A. Miller. The Process of Question Answering - A Computer Simulation of Cognition , 1980, CL.

[13] M. E. Maron,et al. On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[14] Sean R. Eddy,et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[15] Jade Goldstein-Stewart,et al. Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[16] JAIST,et al. Rewriting Saves Extracted Summaries , 2002 .

[17] Daniel Marcu,et al. Statistics-Based Summarization - Step One: Sentence Compression , 2000, AAAI/IAAI.

[18] Philip Resnik,et al. Mining the Web for Bilingual Text , 1999, ACL.

[19] John Cocke,et al. A Statistical Approach to Machine Translation , 1990, CL.

[20] New York Dover,et al. ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[21] W. Bruce Croft,et al. Query expansion using local and global document analysis , 1996, SIGIR '96.

[22] M. F.,et al. Bibliography , 1985, Experimental Gerontology.

[23] Ciprian Chelba,et al. A Structured Language Model , 1997, ACL.

[24] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[25] Milton Abramowitz,et al. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[26] John D. Lafferty,et al. Information retrieval as statistical translation , 1999, SIGIR '99.

[27] John D. Lafferty,et al. The Candide System for Machine Translation , 1994, HLT.

[28] John D. Lafferty,et al. Towards History-based Grammars: Using Richer Models for Probabilistic Parsing , 1993, ACL.

[29] Adwait Ratnaparkhi,et al. A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[30] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[31] Frederick Jelinek,et al. Statistical methods for speech recognition , 1997 .

[32] R. Okafor. Maximum likelihood estimation from incomplete data , 1987 .

[33] Robert Miller,et al. Just-in-time language modelling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[34] I. Good. THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[35] Eduard Hovy,et al. Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[36] Inderjeet Mani,et al. Machine Learning of Generic and User-Focused Summarization , 1998, AAAI/IAAI.

[37] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[38] Jian-Yun Nie,et al. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[39] L. Baum,et al. An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[40] A. Poritz,et al. Hidden Markov models: a guided tour , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[41] Martin Franz,et al. Machine translation and monolingual information retrieval (poster abstract) , 1999, SIGIR '99.

[42] John D. Lafferty,et al. The Weaver System for Document Retrieval , 1999, TREC.

[43] Vibhu O. Mittal,et al. OCELOT: a system for summarizing Web pages , 2000, SIGIR '00.

[44] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[45] Vibhu O. Mittal,et al. Bridging the lexical chasm: statistical approaches to answer-finding , 2000, SIGIR '00.