Using Word Embeddings for Bilingual Unsupervised WSD

Unsupervised Word Sense Disambiguation (WSD) is one of the challenging problems in natural language processing. Recently, an unsupervised bilingual WSD approach has been proposed. This approach uses context aware EM formulation for estimating the sense distribution by using the co-occurrence counts of cross-linked words in comparable corpora. WordNetbased similarity measures are used for approximating the co-occurrence counts. In this paper, we explore the feasibility of the use of Word Embeddings for approximating these counts, which is an extension to the existing approach. We evaluated our approach for Hindi-Marathi language pair, on Health domain. On using the combination of Word Embeddings and WordNet-based similarity measures, we observed 8.5% and 2.5% improvement in the F-score of verbs and adjectives respectively for Marathi and 7% improvement in the F-score of adjectives for Hindi. The experiments show that the combination of Word Embeddings and WordNetbased similarity measures is a reasonable approximation for the bilingual WSD.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Zhiyuan Liu,et al.  A Unified Model for Word Sense Representation and Disambiguation , 2014, EMNLP.

[3]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.

[4]  Julie Weeds,et al.  Finding Predominant Word Senses in Untagged Text , 2004, ACL.

[5]  Ondrej Bojar,et al.  HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation , 2014, LREC.

[6]  Lucia Specia,et al.  Exploiting parallel texts to produce a multilingual sense tagged corpus for word sense disambiguation , 2007 .

[7]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[8]  Jean Véronis,et al.  HyperLex: lexical cartography for information retrieval , 2004, Comput. Speech Lang..

[9]  Hwee Tou Ng,et al.  Supervised Word Sense Disambiguation with Support Vector Machines and multiple knowledge sources , 2004, SENSEVAL@ACL.

[10]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[11]  Pushpak Bhattacharyya,et al.  All Words Domain Adapted WSD: Finding a Middle Ground between Supervision and Unsupervision , 2010, ACL.

[12]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[13]  Pushpak Bhattacharyya,et al.  It Takes Two to Tango: A Bilingual Unsupervised Approach for Estimating Sense Distributions using Expectation Maximization , 2011, IJCNLP.

[14]  Hiroyuki Kaji,et al.  Unsupervised word sense disambiguation using bilingual comparable corpora , 2002, COLING 2002.

[15]  Pushpak Bhattacharyya,et al.  Synset Based Multilingual Dictionary: Insights, Applications and Challenges , 2008 .

[16]  Véronique Hoste,et al.  SemEval-2010 Task 3: Cross-Lingual Word Sense Disambiguation , 2010, SemEval@ACL.

[17]  Pushpak Bhattacharyya,et al.  Neighbors Help: Bilingual Unsupervised WSD Using Context , 2013, ACL.

[18]  Rada Mihalcea,et al.  PageRank on Semantic Networks, with Application to Word Sense Disambiguation , 2004, COLING.

[19]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[20]  Philip Resnik,et al.  An Unsupervised Method for Word Sense Tagging using Parallel Corpora , 2002, ACL.