Semi-supervised word sense disambiguation based on weakly controlled sense induction

Word Sense Disambiguation in text is still a difficult problem as the best supervised methods require laborious and costly manual preparation of training data. On the other hand, the unsupervised methods express significantly lower accuracy and produce results that are not satisfying for many application. The goal of this work is to develop a model of Word Sense Disambiguation which minimises the amount of the required human intervention, but still assigns senses that come from a manually created lexical semantics resource, i.e., a wordnet. The proposed method is based on clustering text snippets including words in focus. Next, for each cluster we found a core, the core is labelled with a word sense by a human and finally is used to produce a classifier. Classifiers, constructed for each word separately, are applied to text. A performed comparison showed that the approach is close in its precision to a fully supervised one tested on the same data for Polish, and is much better than a baseline of the most frequent sense selection. Possible ways for overcoming the limited coverage of the approach are also discussed in the paper.

[1]  Hwee Tou Ng,et al.  Domain Adaptation with Active Learning for Word Sense Disambiguation , 2007, ACL.

[2]  German Rigau,et al.  Supervised Corpus-based Methods for Word Sense Disambiguation , 2006 .

[3]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications (Text, Speech and Language Technology) , 2006 .

[4]  Ted Pedersen,et al.  Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces , 2004, CoNLL.

[5]  Eneko Agirre,et al.  Semeval-2007 Task 2 : Evaluating Word Sense Induction and Discrimination , 2007 .

[6]  Tianshun Yao,et al.  Active Learning with Sampling by Uncertainty and Density for Word Sense Disambiguation and Text Classification , 2008, COLING.

[7]  Krister Lindén Word Senses , 2005 .

[8]  Stan Szpakowicz,et al.  Corpus-based Semantic Relatedness for the Construction of Polish WordNet , 2008, LREC.

[9]  Maciej Piasecki,et al.  A Wordnet from the ground up , 2009 .

[10]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.

[11]  Adam Kilgarriff,et al.  THE HARD PARTS OF LEXICOGRAPHY , 1998 .

[12]  Rada Mihalcea,et al.  Word sense disambiguation with pattern learning and automatic feature selection , 2002, Natural Language Engineering.

[13]  Janyce Wiebe,et al.  Word-Sense Disambiguation Using Decomposable Models , 1994, ACL.

[14]  Izabella Thomas Maciej PIASECKI, Stanis?aw SZPAKOWICZ, Bartosz BRODA, « A Wordnet from the Ground Up », Oficyna Wydawnicza Politechniki Wroc?awskiej , 2010 .

[15]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[16]  Stan Szpakowicz,et al.  Automatic Selection of Heterogeneous Syntactic Features in Semantic Similarity of Polish Nouns , 2007, TSD.

[17]  M. Piasecki,et al.  Polish tagger TaKIPI: rule based construction and optimization , 2007 .

[18]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[19]  Rada Mihalcea,et al.  The Role of Non-Ambiguous Words in Natural Language Disambiguation , 2003 .

[20]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[21]  Stan Szpakowicz,et al.  Sense-based clustering of Polish nouns in the extraction of semantic relatedness , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[22]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[23]  Maciej Piasecki,et al.  Experiments in Documents Clustering for the Automatic Acquisition of Lexical Semantic Networks for Polish , 2008 .

[24]  Ted Pedersen,et al.  Unsupervised Corpus-Based Methods for WSD , 2007 .

[25]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[26]  Ted Pedersen,et al.  SenseClusters: Unsupervised Clustering and Labeling of Similar Contexts , 2005, ACL.

[27]  Ted Pedersen Computational Approaches to Measuring the Similarity of Short Contexts : A Review of Applications and Methods , 2008, ArXiv.

[28]  Christopher G. Chute,et al.  Cluster Stopping Rules for Word Sense Discrimination , 2006 .

[29]  Rada Mihalcea,et al.  An Automatic Method for Generating Sense Tagged Corpora , 1999, AAAI/IAAI.

[30]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[31]  German Rigau,et al.  Supervised Corpus-Based Methods for WSD , 2007 .

[32]  Jingbo Zhu,et al.  Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem , 2007, EMNLP.

[33]  April,et al.  Conference of the European Chapter of the Association for Computational Linguistics Multi-word-expressions in a Multilingual Context Grouping Multi-word Expressions According to Part-of-speech in Statistical Machine Translation Automatic Extraction of Chinese Multiword Expressions with a Statistical , 2006 .

[34]  Adam Kilgarriff,et al.  An Evaluation of a Lexicographer's Workbench Incorporating Word Sense Disambiguation , 2003, CICLing.

[35]  Patrick Pantel,et al.  Clustering by committee , 2003 .

[36]  Jingbo Zhu,et al.  Learning a Stopping Criterion for Active Learning for Word Sense Disambiguation and Text Classification , 2008, IJCNLP.

[37]  Maciej Piasecki,et al.  Towards Word Sense Disambiguation of Polish , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[38]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[39]  Steven Abney,et al.  Semisupervised Learning for Computational Linguistics , 2007 .

[40]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications , 2007 .