Minimally supervised techniques for bilingual lexicon extraction

Normally, word translations are extracted from non-parallel, bilingual corpora, and initial bilingual lexicon, i.e., a list of known translations, is typically used to aid the learning process. This thesis highlights the study of a series of novel techniques that utilized scarce resources. To make the study even more challenging, only minimal use of resources was allowed and important major linguistic tools were not employed. Thus, this study introduces some novel techniques for learning a translation lexicon based on a minimally-supervised, context-based approach. The performance of each technique was measured by comparing the extracted lexicon to a reference lexicon based on the F1 score, which is a weighted average of the precision and the recall. The scores may range from 0 (worst) to 100% (best). Analysis performed on the proposed techniques showed that these techniques had recorded promising F1 scores, ranging from 57.1% to 80.9%, which indicate moderate and best performances. Overall, the �findings of this study further reinforce the use of techniques in exploiting words from small corpora, suggesting that words that are contextually-relevant and occurring in a similar domain are potentially useful. This thesis also presents a technique to deploy extra (i.e., additional) data, which are harvested from the web, and a novel method for measuring similarity of features between two words of different languages without involving the use of initial bilingual lexicon.

[1]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[2]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[3]  Kumiko Tanaka-Ishii,et al.  Extraction of Lexical Translations from Non-Aligned Corpora , 1996, COLING.

[4]  Philippe Langlais,et al.  Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. , 2011, BUCC@ACL.

[5]  Magnus Sahlgren,et al.  The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces , 2006 .

[6]  Pascale Fung,et al.  A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora , 1998, AMTA.

[7]  I. Dan Melamed A Word-to-Word Model of Translational Equivalence , 1997, ACL.

[8]  Dan Klein,et al.  Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.

[9]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[10]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[11]  Viktor Pekar,et al.  Finding translations for low-frequency words in comparable corpora , 2006, Machine Translation.

[12]  Pierre Zweigenbaum,et al.  Looking for French-English translations in comparable medical corpora , 2002, AMIA.

[13]  Philipp Koehn,et al.  Estimating Word Translation Probabilities from Unrelated Monolingual Corpora Using the EM Algorithm , 2000, AAAI/IAAI.

[14]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[15]  Takaaki Tanaka Measuring the Similarity between Compound Nouns in Different Languages Using Non-Parallel Corpora , 2002, COLING.

[16]  Khalil Sima'an,et al.  Corpus Variations for Translation Lexicon Induction , 2006, AMTA.

[17]  Gen-ichiro Kikui Term-list Translation using Mono-lingual Word Co-occurence Vectors , 1998, COLING-ACL.

[18]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[19]  David Yarowsky,et al.  Improving Translation Lexicon Induction from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences , 2009, CoNLL.

[20]  Pascale Fung,et al.  Multi-level Bootstrapping For Extracting Parallel Sentences From a Quasi-Comparable Corpus , 2004, COLING.

[21]  Emmanuel Morin,et al.  Compilation of Specialized Comparable Corpora in French and Japanese , 2009, BUCC@ACL/IJCNLP.

[22]  Elsayed M. Saad,et al.  Toward an ARABIC Stop-Words List Generation , 2012 .

[23]  Jean-Michel Renders,et al.  A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora , 2004, ACL.

[24]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[25]  Stephen DellaPietra,et al.  Candide: A Statistical Machine Translation System , 1994, HLT.

[26]  Ari Rappoport,et al.  Bilingual Lexicon Generation Using Non-Aligned Signatures , 2010, ACL.

[27]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[28]  Mona T. Diab,et al.  A statistical word-level translation model for comparable corpora , 2000 .

[29]  Chris Callison-Burch,et al.  Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora , 2004, ACL.

[30]  I. Dan Melamed,et al.  A Geometric Approach to Mapping Bitext Correspondence , 1996, EMNLP.

[31]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[32]  José Ramom Pichel Campos,et al.  Learning Spanish-Galician Translation Equivalents Using a Comparable Corpus and a Bilingual Dictionary , 2008, CICLing.

[33]  Lizhu Hao,et al.  Automatic Identification of Stop Words in Chinese Text Classification , 2008, 2008 International Conference on Computer Science and Software Engineering.

[34]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[35]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.

[36]  David J. Weir,et al.  A General Framework for Distributional Similarity , 2003, EMNLP.

[37]  Pascale Fung,et al.  Aligning Noisy Parallel Corpora Across Language Groups: Word Pair Feature Matching by Dynamic Time Warping , 1994, AMTA.

[38]  I. Dan Melamed,et al.  Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons , 1995, VLC@ACL.

[39]  John Fry Assembling a Parallel Corpus from RSS News Feeds , 2005, MTSUMMIT.

[40]  Reinhard Rapp,et al.  Identifying Word Translations from Comparable Documents Without a Seed Lexicon , 2012, LREC.

[41]  Chung-Hsing Yeh,et al.  Identifying Parallel Web Documents by Filenames , 2004, APWeb.

[42]  Kenneth Ward Church,et al.  K-vec: A New Approach for Aligning Parallel Texts , 1994, COLING.

[43]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[44]  Philipp Koehn,et al.  Knowledge Sources for Word-Level Translation Models , 2001, EMNLP.

[45]  Satoru Ikehara,et al.  Learning Bilingual Collocations by Word-Level Sorting , 1996, COLING.

[46]  Stanley Peters,et al.  A Bootstrapping Method for Extracting Bilingual Text Pairs , 2000, COLING.

[47]  Yaser Al-Onaizan,et al.  Translating with Scarce Resources , 2000, AAAI/IAAI.

[48]  Yun-Chuang Chiao,et al.  A Novel Approach to Improve Word Translations Extraction from Non-Parallel , Comparable Corpora , 2004 .

[49]  Hiroyuki Kaji,et al.  Learning Translation Templates From Bilingual Text , 1992, COLING.

[50]  Wilson Wong Learning lightweight ontologies from text across different domains using the web as background knowledge , 2009 .

[51]  Serge Sharoff,et al.  Using Comparable Corpora to Solve Problems Difficult for Human Translators , 2006, ACL.

[52]  Chris Callison-Burch,et al.  Bootstrapping Parallel Corpora , 2003, ParallelTexts@NAACL-HLT.

[53]  I. Dan Melamed,et al.  Statistical Machine Translation by Parsing , 2004, ACL.

[54]  Jun'ichi Tsujii,et al.  Robust Measurement and Comparison of Context Similarity for Finding Translation Pairs , 2010, COLING.

[55]  Yuji Matsumoto,et al.  A Comparative Study on Translation Units for Bilingual Lexicon Extraction , 2001, DDMMT@ACL.

[56]  Takaaki TANAKA,et al.  Extraction of translation equivalents from non-parallel corpora , 1999, TMI.

[57]  Iadh Ounis,et al.  Automatically Building a Stopword List for an Information Retrieval System , 2005, J. Digit. Inf. Manag..

[58]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[59]  M. Rey Learning a Translation Lexicon from Monolingual Corpora , 2002 .

[60]  Kyo Kageura,et al.  Anchor Points for Bilingual Lexicon Extraction from Small Comparable Corpora , 2009, MTSUMMIT.

[61]  Pascale Fung,et al.  Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[62]  Jean-Michel Renders,et al.  Report on CLEF-2003 Experiments: Two Ways of Extracting Multilingual Resources from Corpora , 2003, CLEF.

[63]  Philippe Langlais,et al.  Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora , 2010, COLING.

[64]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[65]  José Ramom Pichel Campos,et al.  An Approach to Acquire Word Translations from Non-parallel Texts , 2005, EPIA.

[66]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.

[67]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.