CO-graph: A new graph-based technique for cross-lingual word sense disambiguation

In this paper, we present a new method based on co-occurrence graphs for performing Cross-Lingual Word Sense Disambiguation (CLWSD). The proposed approach comprises the automatic generation of bilingual dictionaries, and a new technique for the construction of a co-occurrence graph used to select the most suitable translations from the dictionary. Different algorithms that combine both the dictionary and the co-occurrence graph are then used for performing this selection of the final translations: techniques based on sub-graphs (communities) containing clusters of words with related meanings, based on distances between nodes representing words, and based on the relative importance of each node in the whole graph. The initial output of the system is enhanced with translation probabilities, provided by a statistical bilingual dictionary. The system is evaluated using datasets from two competitions: task 3 of SemEval 2010, and task 10 of SemEval 2013. Results obtained by the different disambiguation techniques are analysed and compared to those obtained by the systems participating in the competitions. Our system offers the best results in comparison with other unsupervised systems in most of the experiments, and even overcomes supervised systems in some cases.

[1]  German Rigau,et al.  Supervised Corpus-Based Methods for WSD , 2007 .

[2]  Marine Carpuat,et al.  NRC: A Machine Translation Approach to Cross-Lingual Word Sense Disambiguation (SemEval-2013 Task 10) , 2013, *SEMEVAL.

[3]  Maarten van Gompel,et al.  UvT-WSD1: A Cross-Lingual Word Sense Disambiguation System , 2010, SemEval@ACL.

[4]  Simone Paolo Ponzetto,et al.  BabelNet: Building a Very Large Multilingual Semantic Network , 2010, ACL.

[5]  Rada Mihalcea,et al.  Word Sense Disambiguation Using Wikipedia , 2013, The People's Web Meets NLP.

[6]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[7]  Weiwei Guo,et al.  COLEUR and COLSLM: A WSD approach to multilingual lexical substitution, tasks 2 and 3 SemEval 2010 , 2010 .

[8]  Véronique Hoste,et al.  SemEval-2013 Task 10: Cross-lingual Word Sense Disambiguation , 2013, *SEMEVAL.

[9]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[10]  Juan Martínez-Romo,et al.  Disentangling categorical relationships through a graph of co-occurrences. , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[12]  Martine De Cock,et al.  ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation , 2011, ACL.

[13]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[14]  Simone Paolo Ponzetto,et al.  Joining Forces Pays Off: Multilingual Joint Word Sense Disambiguation , 2012, EMNLP.

[15]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[16]  Mirella Lapata,et al.  An Experimental Study of Graph Connectivity for Unsupervised Word Sense Disambiguation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Véronique Hoste,et al.  Construction of a Benchmark Data Set for Cross-lingual Word Sense Disambiguation , 2010, LREC.

[18]  Eneko Agirre,et al.  Random Walks for Knowledge-Based Word Sense Disambiguation , 2014, CL.

[19]  Antal van den Bosch,et al.  WSD2: Parameter optimisation for Memory-based Cross-Lingual Word-Sense Disambiguation , 2013, SemEval@NAACL-HLT.

[20]  Matthieu Latapy,et al.  Computing Communities in Large Networks Using Random Walks , 2004, J. Graph Algorithms Appl..

[21]  Carina Silberer,et al.  UHD: Cross-Lingual Word Sense Disambiguation Using Multilingual Co-Occurrence Graphs , 2010, *SEMEVAL.

[22]  Philip Resnik,et al.  Exploiting Hidden Meanings: Using Bilingual Text for Monolingual Annotation , 2004, CICLing.

[23]  Michael Gasser,et al.  HLTDI: CL-WSD Using Markov Random Fields for SemEval-2013 Task 10 , 2013, *SEMEVAL.

[24]  Rada Mihalcea,et al.  Multilingual Word Sense Disambiguation Using Wikipedia , 2013, IJCNLP.

[25]  Marianna Apidianaki,et al.  Data-Driven Semantic Analysis for Multilingual WSD and Lexical Selection in Translation , 2009, EACL.

[26]  Christian Biemann,et al.  Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems , 2006 .

[27]  Dimitar Kazakov,et al.  Retrieving Lexical Semantics from Multilingual Corpora , 2010, Polytech. Open Libr. Int. Bull. Inf. Technol. Sci..

[28]  Rada Mihalcea,et al.  Word Sense Disambiguation with Multilingual Features , 2011, IWCS.

[29]  Marianna Apidianaki Translation-oriented Word Sense Induction Based on Parallel Corpora , 2008, LREC.

[30]  Hwee Tou Ng,et al.  Word Sense Disambiguation Improves Statistical Machine Translation , 2007, ACL.

[31]  Véronique Hoste,et al.  SemEval-2010 Task 3: Cross-Lingual Word Sense Disambiguation , 2010, SemEval@ACL.

[32]  Rada Mihalcea,et al.  Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling , 2005, HLT.

[33]  Francis Bond,et al.  XLING: Matching Query Sentences to a Parallel Corpus using Topic Models for WSD , 2013, SemEval@NAACL-HLT.

[34]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[35]  Daphne Koller,et al.  Word-Sense Disambiguation for Machine Translation , 2005, HLT.

[36]  Marianna Apidianaki LIMSI : Cross-lingual Word Sense Disambiguation using Translation Sense Clustering , 2013, SemEval@NAACL-HLT.

[37]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[38]  Philip Resnik,et al.  An Unsupervised Method for Word Sense Tagging using Parallel Corpora , 2002, ACL.

[39]  Rada Mihalcea,et al.  Knowledge-Based Methods for WSD , 2007 .

[40]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[41]  David Yarowsky,et al.  Distinguishing systems and distinguishing senses: new evaluation methods for Word Sense Disambiguation , 1999, Natural Language Engineering.

[42]  Gemma Boleda,et al.  Wikicorpus: A Word-Sense Disambiguated Multilingual Wikipedia Corpus , 2010, LREC.

[43]  Rada Mihalcea,et al.  Unsupervised Word Sense Disambiguation with Multilingual Representations , 2012, LREC.

[44]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[45]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[46]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[47]  Dan TUFIŞ Multilingual Word Sense Disambiguation Using Aligned Wordnets , 2004 .

[48]  Darnes Vilariño Ayala,et al.  FCC: Modeling Probabilities with GIZA++ for Task 2 and 3 of SemEval-2 , 2010, SemEval@ACL.

[49]  Dimitar Kazakov,et al.  Using Parallel Corpora for Word Sense Disambiguation , 2013, RANLP.