Method for Improving Automatic Word Categorization

This paper presents a new approach to automatic word categorization which improves both the efficiency of the algorithm and the quality of the formed clusters. The unigram and the bigram statistics of a corpus of about two million words are used with an efficient distance function to measure the similarities of words, and a greedy algorithm to put the words into clusters. The notions of fuzzy clustering like cluster prototypes, degree of membership are used to form up the clusters. The algorithm is of unsupervised type and the number of clusters are determined at run-time.

[1]  L MercerRobert,et al.  Class-based n-gram models of natural language , 1992 .

[2]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[3]  G. Āllport The Psycho-Biology of Language. , 1936 .

[4]  Francis Jack Smith,et al.  Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies , 1995, Comput. Linguistics.

[5]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[6]  George R. Kiss,et al.  Grammatical Word Classes: A Learning Process and its Simulation , 1973 .

[7]  Eric Brill,et al.  Deducing Linguistic Structure from the Statistics of Large Corpora , 1990, HLT.

[8]  Isak Gath,et al.  Unsupervised Optimal Fuzzy Clustering , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Steven Finch,et al.  Finding structure in language , 1995 .

[10]  G. J. Wilms,et al.  Automated induction of a lexical sublanguage grammar using a hybrid system of corpus- and knowledge-based techniques , 1996 .

[11]  Stanley F. Chen,et al.  Building Probabilistic Models for Natural Language , 1996, ArXiv.

[12]  J. Hughes,et al.  Automatically acquiring and evaluating a classification of words , 1993 .

[13]  Matthew Haines,et al.  Integrating Knowledge Bases and Statistics in MT , 1994, AMTA.

[14]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[15]  Kenneth Ward Church,et al.  Work on Statistical Methods for Word Sense Disambiguation , 1992 .

[16]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[17]  Hermann Ney,et al.  Forming Word Classes by Statistical Clustering for Statistical Language Modelling , 1993 .