Sub-corpora Sampling with an Application to Bilingual Lexicon Extraction

We propose a novel associative approach for bilingual word lexicon extraction (BLE) from parallel corpora that relies on the paradigm of data reduction instead of data augmentation. The key insight of the approach is the effective usage of sub-corpora sampling and properties of low-frequency words in the task of lexicon induction, particularly in a setting where only limited parallel data are available. Word translation pairs are extracted from many smaller sub-corpora (sampled from the original corpus) according to several frequency-based criteria of similarity. We prove the validity of our data sampling approach, and show that this method outperforms IBM Model 1 and associative methods based on similarity scores and hypothesis testing in terms of precision and F-measure in the task of lexicon extraction. Additionally, we show that our sampling-based method can learn correct word translations from fewer data. TITLE AND ABSTRACT IN ANOTHER LANGUAGE (CROATIAN) Uzorkovanje Potkorpusa uz Primjenu u Ekstrakciji Dvojezicnih Rjecnika U radu se predlaže nov asocijativan pristup ekstrakciji dvojezicnih rjecnika iz usporednih korpusa koji se oslanja na paradigmu smanjivanja kolicine podataka umjesto njezinog povecavanja. Kljucna je ideja pristupa ucinkovita uporaba uzorkovanja potkorpusa te svojstava niskofrekventnih rijeci u zadatku indukcije rjecnika, posebice u situacijama kada je na raspolaganju ogranicen skup usporednih podataka. Prijevodni parovi rijeci ekstrahirani su iz veceg broja manjih potkorpusa (uzorkovanih iz izvornog korpusa) temeljem nekoliko frekvencijski utemeljenih kriterija slicnosti. U radu je pokazana ispravnost naseg pristupa temeljenog na uzorkovanju potkorpusa. Pokazano je da ovaj postupak u smislu F-mjere na zadatku ekstrakcije leksikona nadmasuje IBM-ov Model 1 te asocijativne postupke temeljene na ocjenama slicnosti i testiranju hipoteze. Takoder je pokazano da nas postupak temeljen na uzorkovanju može nauciti ispravne prijevode rijeci iz manjih kolicina podataka.

[1]  Hermann Ney,et al.  Word-Level Confidence Estimation for Machine Translation , 2007, CL.

[2]  Pascale Fung,et al.  Rare Word Translation Extraction from Aligned Comparable Documents , 2011, ACL.

[3]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[4]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[5]  Alexander H. Waibel,et al.  Effective Phrase Translation Extraction from Alignment Models , 2003, ACL.

[6]  Yiming Yang,et al.  Translingual Information Retrieval: A Comparative Evaluation , 1997, IJCAI.

[7]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[8]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[9]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[10]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[11]  Mona T. Diab,et al.  A statistical translation model using comparable corpora , 2000, RIAO.

[12]  Kyo Kageura,et al.  Bilingual Terminology Mining - Using Brain, not brawn comparable corpora , 2007, ACL.

[13]  Dan Klein,et al.  Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.

[14]  Robert C. Moore Towards a Simple and Accurate Statistical Approach to Learning Translation Relationships among Words , 2001, DDMMT@ACL.

[15]  Marie-Francine Moens,et al.  Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge , 2012, EACL.

[16]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[17]  Marie-Francine Moens,et al.  Identifying Word Translations from Comparable Corpora Using Latent Topic Models , 2011, ACL.

[18]  Alan Agresti,et al.  Categorical Data Analysis , 2003 .

[19]  Douglas W. Oard,et al.  Dictionary-based techniques for cross-language information retrieval , 2005, Inf. Process. Manag..

[20]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[21]  Véronique Hoste,et al.  Language-Independent Bilingual Terminology Extraction from a Multilingual Parallel Corpus , 2009, EACL.

[22]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[23]  Ari Rappoport,et al.  Bilingual Lexicon Generation Using Non-Aligned Signatures , 2010, ACL.

[24]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[25]  Kenneth Ward Church,et al.  Robust Bilingual Word Alignment for Machine Aided Translation , 1993, VLC@ACL.

[26]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[27]  Ted Pedersen,et al.  An Evaluation Exercise for Word Alignment , 2003, ParallelTexts@NAACL-HLT.

[28]  Robert C. Moore Improving IBM Word Alignment Model 1 , 2004, ACL.

[29]  Jörg Tiedemann,et al.  Combining Clues for Word Alignment , 2003, EACL.

[30]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[31]  Jun'ichi Tsujii,et al.  Robust Measurement and Comparison of Context Similarity for Finding Translation Pairs , 2010, COLING.

[32]  Pascale Fung,et al.  Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[33]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[34]  Taro Watanabe,et al.  Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation , 2012, EMNLP.

[35]  Philippe Langlais,et al.  Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora , 2010, COLING.

[36]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[37]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[38]  Alberto Barrón-Cedeño,et al.  A statistical approach to crosslingual natural language tasks , 2008, LA-NMR.

[39]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[40]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[41]  Robert C. Moore On Log-Likelihood-Ratios and the Significance of Rare Events , 2004, EMNLP.

[42]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[43]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.