Contextual word similarity and estimation from sparse data

Abstract In recent years there is much interest in word co-occurrence relations, such as n-grams, verb–object combinations, or co-occurrence within a limited context. This paper discusses how to estimate the likelihood of co-occurrences that do not occur in the training data. We present a method that makes local analogies between each specific unobserved co-occurrence and other co-occurrences that contain similar words. These analogies are based on the assumption that similar word co-occurrences have similar values of mutual information. Accordingly, the word similarity metric captures similarities between vectors of mutual information values. Our evaluation suggests that this method performs better than existing, frequency-based, smoothing methods, and may provide an alternative to class-based models. A background survey is included, covering issues of lexical co-occurrence, data sparseness and smoothing, word similarity and clustering, and mutual information.

[1]  Kathleen McKeown,et al.  Automatically Extracting and Representing Collocations for Language Generation , 1990, ACL.

[2]  Philip Resnik,et al.  WordNet and Distributional Analysis: A Class-based Approach to Lexical Discovery , 1992, AAAI 1992.

[3]  Eric Brill,et al.  Deducing Linguistic Structure from the Statistics of Large Corpora , 1990, HLT.

[4]  Hinrich Schütze,et al.  Word Space , 1992, NIPS.

[5]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[6]  Yorick Wilks,et al.  An intelligent analyzer and understander of English , 1975, Commun. ACM.

[7]  Alon Itai,et al.  Two Languages Are More Informative Than One , 1991, ACL.

[8]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[9]  Ralph Grishman,et al.  Discovery Procedures for Sublanguage Selectional Patterns: Initial Experiments , 1986, Comput. Linguistics.

[10]  Naftali Tishby,et al.  Distributional Similarity, Phase Transitions and Hierarchical Clustering , 1992 .

[11]  Victor Sadler,et al.  Working With Analogical Semantics: Disambiguation Techniques in Dlt. , 1989 .

[12]  Ido Dagan,et al.  Contextual Word Similarity and Estimation from Sparse Data , 1993, ACL.

[13]  R. Hartley Transmission of information , 1928 .

[14]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[15]  Volker Steinbiss,et al.  Cooccurrence smoothing for stochastic language modeling , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Mill Johannes G.A. Van,et al.  Transmission Of Information , 1961 .

[17]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[18]  Yoelle Maarek,et al.  Full text indexing based on lexical relations an application: software libraries , 1989, SIGIR '89.

[19]  Robert L. Mercer,et al.  Word-Sense Disambiguation Using Statistical Methods , 1991, ACL.

[20]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[21]  Ronald Rosenfeld,et al.  Adaptive Language Modeling Using the Maximum Entropy Principle , 1993, HLT.

[22]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[23]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[24]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[25]  Donald Hindle,et al.  Deterministic Parsing of Syntactic Non-fluencies , 1983, ACL.

[26]  Ido Dagan,et al.  Similarity-Based Estimation of Word Cooccurrence Probabilities , 1994, ACL.

[27]  Yoelle S. Maarek,et al.  Full-Text Indexing Based on Lexical Relations , 1989 .

[28]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[29]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[30]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[31]  M. Lennig,et al.  A language model for very large-vocabulary speech recognition , 1992 .

[32]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.