Combining Different Seed Dictionaries to Extract Lexicon from Comparable Corpus

In recent years, many studies on extracting new bilingual lexicons from non-parallel (comparable) corpora have been proposed. Nearly all apply an existing small dictionary or other resource to make an initial list named seed dictionary. In this paper we discuss on using different types of dictionaries and their combinations as the initial starting list to produce a bilingual Persian-Italian lexicon from a comparable corpus. Our experiments applied state of the art techniques on four different seed dictionaries; an existing dictionary and three dictionaries created with pivot-based schema considering three different languages as pivot. We have used English, Arabic and French as pivot languages to extract these three pivot based dictionaries. An interesting challenge in our approach is proposing a method to combine different dictionaries together producing a better and more accurate lexicon. In order to combine seed dictionaries we proposed two novel combination models and examine the effect of them on comparable corpora which are collected from News Agencies. The experimental results exploited by our implementation show the efficiency of our proposed combinations.

[1]  Xabier Saralegi,et al.  Building a Basque-Chinese Dictionary by Using English as Pivot , 2012, LREC.

[2]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[3]  Amir HAZEM,et al.  ICA for Bilingual Lexicon Extraction from Comparable Corpora , 2012 .

[4]  M.L.E. van der Plas,et al.  Syntactic Contexts for finding Semantically Similar Words , 2005 .

[5]  Philippe Langlais,et al.  Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora , 2010, COLING.

[6]  Hans Uszkoreit Proceedings of the 33rd annual meeting on Association for Computational Linguistics , 1995 .

[7]  Hiroyuki Kaji,et al.  Extracting Word Correspondences from Bilingual Corpora Based on Word Co-occurrence Information , 1996, COLING.

[8]  Éric Gaussier,et al.  Bilingual terminology extraction : an approach based on a multilingual thesaurus applicable to comparable corpora , 2002 .

[9]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[10]  Kumiko Tanaka-Ishii,et al.  Construction of a Bilingual Dictionary Intermediated by a Third Language , 1994, COLING.

[11]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[12]  Pierre Zweigenbaum,et al.  Building Specialized Bilingual Lexicons Using Word Sense Disambiguation , 2013, IJCNLP.

[13]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[14]  Hiroyuki Kaji Extracting Translation Equivalents from Bilingual Comparable Corpora , 2005, IEICE Trans. Inf. Syst..

[15]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[16]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[17]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[18]  Takashi Tsunakawa,et al.  Building Bilingual Lexicons using Lexical Translation Probabilities via Pivot Languages , 2008, LREC.

[19]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[20]  István Varga,et al.  Bilingual dictionary generation for low-resourced language pairs , 2009, EMNLP.

[21]  Pierre Zweigenbaum,et al.  Looking for Candidate Translational Equivalents in Specialized, Comparable Corpora , 2002, COLING.

[22]  Reinhard Rapp Die Berechnung von Assoziationen: ein korpuslinguistischer Ansatz , 1995 .

[23]  James R. Curran,et al.  Improvements in Automatic Thesaurus Extraction , 2002, ACL 2002.

[24]  Jonas Sjöbergh Creating a free digital Japanese-Swedish lexicon , 2005 .

[25]  Takashi Tsunakawa,et al.  Improving Calculation of Contextual Similarity for Constructing a Bilingual Dictionary via a Third Language , 2013, IJCNLP.

[26]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[27]  Pablo Gamallo Otero Learning bilingual lexicons from comparable English and Spanish corpora , 2007, MTSUMMIT.

[28]  Michael Zock,et al.  Utilizing Citations of Foreign Words in Corpus-Based Dictionary Generation , 2010 .

[30]  José Ramom Pichel Campos,et al.  Automatic Generation of Bilingual Dictionaries Using Intermediary Languages and Comparable Corpora , 2010, CICLing.