Learning Probabilistic Models of Word Sense Disambiguation

This dissertation presents several new methods of supervised and unsupervised learning of word sense disambiguation models. The supervised methods focus on performing model searches through a space of probabilistic models, and the unsupervised methods rely on the use of Gibbs Sampling and the Expectation Maximization (EM) algorithm. In both the supervised and unsupervised case, the Naive Bayesian model is found to perform well. An explanation for this success is presented in terms of learning rates and bias-variance decompositions.

[1]  Claire Cardie,et al.  A Case-Based Approach to Knowledge Acquisition for Domain-Specific Sentence Analysis , 1993, AAAI.

[2]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[3]  Yorick Wilks,et al.  The grammar of sense: Using part-of-speech tags as a first step in semantic disambiguation , 1998, Natural Language Engineering.

[4]  N. Wermuth Model Search among Multiplicative Models , 1976 .

[5]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[6]  A. Raftery,et al.  How Many Iterations in the Gibbs Sampler , 1991 .

[7]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[8]  Charles J. Fillmore,et al.  THE CASE FOR CASE. , 1967 .

[9]  Xiao-Li Meng,et al.  The EM Algorithm—an Old Folk‐song Sung to a Fast New Tune , 1997 .

[10]  D. Madigan,et al.  Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam's Window , 1994 .

[11]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[12]  Robert L. Mercer,et al.  Word-Sense Disambiguation Using Statistical Methods , 1991, ACL.

[13]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[14]  Peter Norvig,et al.  Marker Passing as a Weak Method for Text Inferencing , 1989, Cogn. Sci..

[15]  Ted Pedersen,et al.  Knowledge Lean Word-Sense Disambiguation , 1997, AAAI/IAAI.

[16]  MiningTed PedersenRebecca Bruce Unsupervised Text Mining , 1997 .

[17]  Nancy Ide,et al.  Word Sense Disambiguation with Very Large Neural Networks Extracted from Machine Readable Dictionaries , 1990, COLING.

[18]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[19]  Yehoshua Bar-Hillel,et al.  The Present Status of Automatic Translation of Languages , 1960, Adv. Comput..

[20]  Susan Bonzi,et al.  Semantic interpretation and the resolution of ambiguity , 1989, JASIS.

[21]  Marti A. Hearst Noun Homograph Disambiguation Using Local Context in Large Text Corpora , 1991 .

[22]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[23]  John Geweke,et al.  Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments , 1991 .

[24]  Pedro M. Domingos Unifying Instance-Based and Rule-Based Induction , 1996, Machine Learning.

[25]  Hwee Tou Ng,et al.  Exemplar-Based Word Sense Disambiguation” Some Recent Improvements , 1997, EMNLP.

[26]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[27]  Chuck Rieger,et al.  Parsing and comprehending with word experts (a theory and its realization) , 1982 .

[28]  Ted Pedersen,et al.  The Measure of a Model , 1996, EMNLP.

[29]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.

[30]  David Yarowsky,et al.  One Sense per Collocation , 1993, HLT.

[31]  S. Lauritzen The EM algorithm for graphical association models with missing data , 1995 .

[32]  Raymond J. Mooney,et al.  Comparative Experiments on Disambiguating Word Senses: An Illustration of the Role of Bias in Machine Learning , 1996, EMNLP.

[33]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[34]  Alon Itai,et al.  Two Languages Are More Informative Than One , 1991, ACL.

[35]  T. Speed,et al.  Markov Fields and Log-Linear Interaction Models for Contingency Tables , 1980 .

[36]  L. Mcquitty Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data , 1966 .

[37]  Janyce Wiebe,et al.  A New Approach to Word Sense Disambiguation , 1994, HLT.

[38]  M. A. Tanner,et al.  Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, 3rd Edition , 1998 .

[39]  Ted Pedersen,et al.  Distinguishing Word Senses in Untagged Text , 1997, EMNLP.

[40]  Yorick Wilks,et al.  An intelligent analyzer and understander of English , 1975, Commun. ACM.

[41]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[42]  Eugene Charniak,et al.  Passing Markers: A Theory of Contextual Influence in Language Comprehension* , 1983 .

[43]  Rebecca Bruce A statistical method for word-sense disambiguation , 1996 .

[44]  Edward F. Kelly,et al.  Computer recognition of English word senses , 1975 .

[45]  Dan I. Moldovan,et al.  Parallel memory-based parsing on SNAP , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[46]  Walter R. Gilks,et al.  A Language and Program for Complex Bayesian Modelling , 1994 .

[47]  Janyce Wiebe,et al.  Word-Sense Disambiguation Using Decomposable Models , 1994, ACL.

[48]  Steven Salzberg,et al.  A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.

[49]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[50]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[51]  Brian D. Ripley,et al.  Stochastic Simulation , 2005 .

[52]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[53]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[54]  J. I The Design of Experiments , 1936, Nature.

[55]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[56]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[57]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[58]  H. Akaike A new look at the statistical model identification , 1974 .

[59]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[60]  Ellen M. Voorhees,et al.  Corpus-Based Statistical Sense Resolution , 1993, HLT.

[61]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[62]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[63]  Ted Pedersen,et al.  Sequential Model Selection for Word Sense Disambiguation , 1997, ANLP.

[64]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[65]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[66]  Ted Pedersen,et al.  A New Supervised Learning Algorithm for Word Sense Disambiguation , 1997, AAAI/IAAI.

[67]  Ezra Black,et al.  An Experiment in Computational Discrimination of English Word Senses , 1988, IBM J. Res. Dev..

[68]  Alon Itai,et al.  Word Sense Disambiguation Using a Second Language Monolingual Corpus , 1994, CL.

[69]  M. Ross Quillian,et al.  The teachable language comprehender: a simulation program and theory of language , 1969, CACM.

[70]  Carla E. Brodley,et al.  Recursive automatic bias selection for classifier construction , 1995, Machine Learning.

[71]  Dan I. Moldovan,et al.  SNAP: parallel processing applied to AI , 1992, Computer.

[72]  Ted Pedersen,et al.  Class.3.0 : a Probabilistic Classiier Using Decomposable Models , 1997 .

[73]  Paul Procter,et al.  Longman Dictionary of Contemporary English , 1978 .

[74]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[75]  Gregory M. Provan,et al.  Data Mining and Model Simplicity: A Case Study in Diagnosis , 1996, KDD.

[76]  Ted Pedersen,et al.  Significant Lexical Relationships , 1996, AAAI/IAAI, Vol. 1.

[77]  Michael J. Pazzani,et al.  Syskill & Webert: Identifying Interesting Web Sites , 1996, AAAI/IAAI, Vol. 1.

[78]  Wray L. Buntine Operations for Learning with Graphical Models , 1994, J. Artif. Intell. Res..

[79]  Svend Kreiner,et al.  Analysis of Multidimensional Contingency Tables by Exact Conditional Tests: Techniques and Strategies , 1987 .

[80]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[81]  Martin Chodorow,et al.  Extracting Semantic Hierarchies from a Large On-Line Dictionary , 1985, ACL.

[82]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[83]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[84]  Timothy R. C. Read,et al.  Goodness-Of-Fit Statistics for Discrete Multivariate Data , 1988 .

[85]  Robert F. Simmons,et al.  Truly Parallel Understanding of Text , 1990, AAAI.

[86]  P. Holland,et al.  Discrete Multivariate Analysis. , 1976 .

[87]  Lynette Hirschman,et al.  Mixed-Initiative Development of Language Processing Systems , 1997, ANLP.

[88]  Cullen Schaffer,et al.  A Conservation Law for Generalization Performance , 1994, ICML.