Automatic Retrieval and Clustering of Similar Words

Bootstrapping semantics from text is one of the greatest challenges in natural language learning. We first define a word similarity measure based on the distributional pattern of words. The similarity measure allows us to construct a thesaurus using a parsed corpus. We then present a new evaluation methodology for the automatically constructed thesaurus. The evaluation results show that the thesaurus is significantly closer to WordNet than Roget Thesaurus is.

[1]  Dedre Gentner,et al.  Why Nouns Are Learned before Verbs: Linguistic Relativity Versus Natural Partitioning. Technical Report No. 257. , 1982 .

[2]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[3]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[4]  Experiments on linguistically based term associations , 1991, RIAO.

[5]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[6]  Gerda Ruge,et al.  Experiments on Linguistically-Based Term Associations , 1992, Inf. Process. Manag..

[7]  Volker Steinbiss,et al.  Cooccurrence smoothing for stochastic language modeling , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[9]  SmadjaFrank Retrieving collocations from text , 1993 .

[10]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[11]  Dekang Lin,et al.  Principle-Based Parsing Without Overgeneration , 1993, ACL.

[12]  Hiyan Alshawi,et al.  Training and Scaling Preference Functions for Disambiguation , 1994, Comput. Linguistics.

[13]  Ido Dagan,et al.  Similarity-Based Estimation of Word Cooccurrence Probabilities , 1994, ACL.

[14]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[15]  Dekang Lin,et al.  PRINCIPAR - An Efficient, Broad-coverage, Principle-based Parser , 1994, COLING.

[16]  Ido Dagan,et al.  Contextual word similarity and estimation from sparse data , 1995, Comput. Speech Lang..

[17]  Ido Dagan,et al.  Similarity-Based Methods for Word Sense Disambiguation , 1997, ACL.

[18]  Dekang Lin,et al.  Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity , 1997, ACL.