Maximum entropy models for natural language ambiguity resolution

This thesis demonstrates that several important kinds of natural language ambiguities can be resolved to state-of-the-art accuracies using a single statistical modeling technique based on the principle of maximum entropy. We discuss the problems of sentence boundary detection, part-of-speech tagging, prepositional phrase attachment, natural language parsing, and text categorization under the maximum entropy framework. In practice, we have found that maximum entropy models offer the following advantages: State-of-the-art accuracy. The probability models for all of the tasks discussed perform at or near state-of-the-art accuracies, or outperform competing learning algorithms when trained and tested under similar conditions. Methods which outperform those presented here require much more supervision in the form of additional human involvement or additional supporting resources. Knowledge-poor features. The facts used to model the data, or features, are linguistically very simple, or "knowledge-poor", but yet succeed in approximating complex linguistic relationships. Reusable software technology. The mathematics of the maximum entropy framework are essentially independent of any particular task, and a single software implementation can be used for all of the probability models in this thesis. The experiments in this thesis suggest that experimenters can obtain state-of-the-art accuracies on a wide range of natural language tasks, with little task-specific effort, by using maximum entropy probability models.

[1]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[2]  I. Good Maximum Entropy for Hypothesis Formulation, Especially for Multidimensional Contingency Tables , 1963 .

[3]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[4]  I. Csiszár $I$-Divergence Geometry of Probability Distributions and Minimization Problems , 1975 .

[5]  B. Shalit Structural ambiguity and limits to coping. , 1977, Journal of human stress.

[6]  H. Künkel Frequency analysis. , 1978, Electroencephalography and clinical neurophysiology. Supplement.

[7]  Mitchell P. Marcus,et al.  A theory of syntactic recognition for natural language , 1979 .

[8]  Patrick Henry Winston,et al.  A Theory of Syntactic Recognition for Natural Language , 1982 .

[9]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[10]  Geoffrey Leech,et al.  The tagged LOB Corpus : user's manual , 1986 .

[11]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[12]  R. Larsen An introduction to mathematical statistics and its applications / Richard J. Larsen, Morris L. Marx , 1986 .

[13]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[14]  I. Csiszár A geometric interpretation of Darroch and Ratcliff's generalized iterative scaling , 1989 .

[15]  Michael Riley,et al.  Some Applications of Tree-based Modelling to Speech and Language , 1989, HLT.

[16]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[17]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[18]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[19]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[20]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[21]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[22]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[23]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[24]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[25]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[26]  John D. Lafferty,et al.  Towards History-based Grammars: Using Richer Models for Probabilistic Parsing , 1993, ACL.

[27]  Dania Egedi,et al.  A Freely Available Wide Coverage Morphological Analyzer for English , 1992, COLING.

[28]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[29]  Eric Brill,et al.  Transformation-Based Error-Driven Parsing , 1993, IWPT.

[30]  Ronald Rosenfeld,et al.  Adaptive Language Modeling Using the Maximum Entropy Principle , 1993, HLT.

[31]  Eric Brill,et al.  A corpus-based approach to language learning , 1993 .

[32]  Richard M. Schwartz,et al.  Coping with Ambiguity and Unknown Words through Probabilistic Models , 1993, CL.

[33]  Ted Briscoe,et al.  Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars , 1993, CL.

[34]  SchwartzRichard,et al.  Coping with ambiguity and unknown words through probabilistic models , 1993 .

[35]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Prepositional Phrase Attachment , 1994, HLT.

[36]  Fernando Sánchez León A Spanish Tagset for the CRATER Project , 1994, ArXiv.

[37]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[38]  John D. Lafferty,et al.  Decision Tree Parsing using a Hidden Derivation Model , 1994, HLT.

[39]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[40]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[41]  Janyce Wiebe,et al.  Word-Sense Disambiguation Using Decomposable Models , 1994, ACL.

[42]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[43]  Eric Brill,et al.  A Rule-Based Approach to Prepositional Phrase Attachment Disambiguation , 1994, COLING.

[44]  Adwait Ratnaparkhi,et al.  A maximum entropy model for parsing , 1994, ICSLP.

[45]  Michael White Presenting Punctuation , 1995, ArXiv.

[46]  Michael Collins,et al.  Prepositional Phrase Attachment through a Backed-off Model , 1995, VLC@ACL.

[47]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[48]  David M. Magerman Statistical Decision-Tree Models for Parsing , 1995, ACL.

[49]  Richard Sproat,et al.  Estimating Lexical Priors for Low-Frequency Morphologically Ambiguous Forms , 1996, Comput. Linguistics.

[50]  Mitchell P. Marcus,et al.  Three machine learning algorithms for lexical ambiguity resolution , 1996 .

[51]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[52]  Jason Eisner,et al.  Three New Probabilistic Models for Dependency Parsing: An Exploration , 1996, COLING.

[53]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[54]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[55]  Attaching Multiple Prepositional Phrases: Generalized Backed-off Estimation , 1997, ArXiv.

[56]  Alexander Franz Independence Assumptions Considered Harmful , 1997, ACL.

[57]  Joshua Goodman,et al.  Probabilistic Feature Grammars , 1997, IWPT.

[58]  Ted Pedersen,et al.  Sequential Model Selection for Word Sense Disambiguation , 1997, ANLP.

[59]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[60]  Makoto Nagao,et al.  Corpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary , 1997, VLC.

[61]  Marti A. Hearst,et al.  Adaptive Multilingual Sentence Boundary Disambiguation , 1997, CL.

[62]  Walter Daelemans,et al.  Memory-Based Learning: Using Similarity for Smoothing , 1997, ACL.

[63]  Ted Pedersen,et al.  A Statistical Decision Making Method: A Case Study on Prepositional Phrase Attachment , 1997, CoNLL.

[64]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[65]  Raymond J. Mooney,et al.  Learning Parse and Translation Decisions from Examples with Rich Context , 1997, ACL.

[66]  Erika F. de Lima Assigning Grammatical Relations with a Back-off Model , 1997, EMNLP.

[67]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[68]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 2022 .

[69]  Satoshi Sekine,et al.  The Domain Dependence of Parsing , 1997, ANLP.

[70]  Hans van Halteren,et al.  Improving Data Driven Wordclass Tagging by System Combination , 1998, ACL.

[71]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.