Sense-based clustering of Polish nouns in the extraction of semantic relatedness

The construction of a wordnet from scratch requires intelligent software support. An accurate measure of semantic relatedness can be used to extract groups of semantically close words from a corpus. Such groups help a lexicographer make decisions about synset membership and synset placement in the network. We have adapted to Polish the well-known algorithm of Clustering by Committee, and tested it on the largest Polish corpus available. The evaluation by way of a plWordNet-based synonymy test used Polish WordNet, a resource still under development. The results are consistent with a few benchmarks, but not encouraging enough yet to make a wordnet writer's support tool immediately useful.

[1]  Maciej Piasecki,et al.  Extended Similarity Test for the Evaluation of Semantic SimilarityFunctions , 2007 .

[2]  Patrick Pantel,et al.  Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations , 2006, ACL.

[3]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[4]  Baowen Xu,et al.  Reasoning within the Extended Fuzzy Description Logics with Restricted Terminological Boxes , 2007 .

[5]  Edmond Chow,et al.  New Experiments in Distributional Representations of Synonymy , 2005, CoNLL.

[6]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[7]  Hang Li,et al.  A Probabilistic Approach to Lexical Semantic Knowledge Acquisition and Structural Disambiguation , 1998, ArXiv.

[8]  Patrick Pantel,et al.  Clustering by committee , 2003 .

[9]  Hitoshi Isahara,et al.  Clustering Using Feature Domain Similarity to Discover Word Senses for Adjectives , 2007, International Conference on Semantic Computing (ICSC 2007).

[10]  Maciej Piasecki,et al.  Words, Concepts and Relations in the Construction of Polish WordNet , 2008 .

[11]  Ted Pedersen,et al.  Unsupervised Corpus-Based Methods for WSD , 2007 .

[12]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications , 2007 .

[13]  Stan Szpakowicz,et al.  Corpus-based Semantic Relatedness for the Construction of Polish WordNet , 2008, LREC.

[14]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications (Text, Speech and Language Technology) , 2006 .

[15]  Stan Szpakowicz,et al.  Automatic Selection of Heterogeneous Syntactic Features in Semantic Similarity of Polish Nouns , 2007, TSD.