Clustering Words with the MDL Principle

We address the problem of automatically constructing a thesaurus by clustering words based on corpus data. We view this problem as that of estimating a joint distribution over the Cartesian product of a partition of a set of nouns and a partition of a set of verbs, and propose a learning algorithm based on the Minimum Description Length (MDL) Principle for such estimation. We empirically compared the performance of our method based on the MDL Principle against the Maximum Likelihood Estimator in word clustering, and found that the former outperforms the latter. We also evaluated the method by conducting pp-attachment disambiguation experiments using an automatically constructed thesaurus. Our experimental results indicate that such a thesaurus can be used to improve accuracy in disambiguation.

[1]  Ido Dagan,et al.  Contextual Word Similarity and Estimation from Sparse Data , 1993, ACL.

[2]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[3]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[4]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[5]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[6]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[7]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[8]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[9]  Ido Dagan,et al.  Contextual word similarity and estimation from sparse data , 1995, Comput. Speech Lang..

[10]  Andreas Stolcke,et al.  Inducing Probabilistic Grammars by Bayesian Model Merging , 1994, ICGI.

[11]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[12]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[13]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[14]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[15]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[16]  Hang Li,et al.  Generalizing Case Frames Using a Thesaurus and the MDL Principle , 1995, CL.

[17]  Kenji Yamanishi,et al.  A learning criterion for stochastic rules , 1990, COLT '90.

[18]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[19]  Tanaka Hozumi,et al.  Automatic thesaurus construction based on grammatical relations , 1995, IJCAI 1995.

[20]  Kenneth Ward Church,et al.  Poor Estimates of Context are Worse than None , 1990, HLT.

[21]  Takenobu Tokunaga,et al.  Automatic Thesaurus Construction based on Grammatical Relations , 1995, IJCAI.

[22]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .