Matching Graph, a Method for Extracting Parallel Information from Comparable Corpora

Comparable corpora are valuable alternatives for the expensive parallel corpora. They comprise informative parallel fragments that are useful resources for different natural language processing tasks. In this work, a generative model is proposed for efficient extraction of parallel fragments from a pair of comparable documents. The core of the proposed model is a graph called the Matching Graph. The ability of the Matching Graph to be trained on a small initial seed makes it a proper model for language pairs suffering from the scarce resource problem. Experiments show that the Matching Graph performs significantly better than other recently published models. According to the experiments on English-Persian and Arabic-Persian language pairs, the extracted parallel fragments can be used instead of parallel data for training statistical machine translation systems. Results reveal that the extracted fragments in the best case are able to retrieve about 90% of the information of a statistical machine translation system that is trained on a parallel corpus. Moreover, it is shown that using the extracted fragments as additional information for training statistical machine translation systems leads to an improvement of about 2% for English-Persian and about 1% for Arabic-Persian translation on BLEU score.

[1]  Huanbo Luan,et al.  Agreement-based Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora , 2016, ACL.

[2]  Dan Klein,et al.  Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.

[3]  Bo Xu,et al.  Phrase-based Parallel Fragments Extraction from Comparable Corpora , 2013, IJCNLP.

[4]  S. Nielsen The stochastic EM algorithm: estimation and asymptotic results , 2000 .

[5]  Pierre Zweigenbaum,et al.  Recent advances in machine translation using comparable corpora , 2016, Natural Language Engineering.

[6]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[7]  Kristina Toutanova,et al.  Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data , 2014, ACL.

[8]  Toru Ishida,et al.  A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families , 2017, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[9]  Éric Gaussier,et al.  Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora , 2010, COLING.

[10]  Marianna Apidianaki,et al.  Cross-lingual WSD for Translation Extraction from Comparable Corpora , 2013, BUCC@ACL.

[11]  Pascale Fung,et al.  Multi-level Bootstrapping For Extracting Parallel Sentences From a Quasi-Comparable Corpus , 2004, COLING.

[12]  Holger Schwenk,et al.  Multimodal Comparable Corpora for Machine Translation , 2014 .

[13]  Taro Watanabe,et al.  Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation , 2012, EMNLP.

[14]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[15]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.

[16]  Benjamin Van Durme,et al.  Learning Bilingual Lexicons Using the Visual Similarity of Labeled Web Images , 2011, IJCAI.

[17]  Mohamed S. Kamel,et al.  Document Similarity Using a Phrase Indexing Graph Model , 2003, Knowledge and Information Systems.

[18]  Holger Schwenk,et al.  Building and using multimodal comparable corpora for machine translation† , 2016, Natural Language Engineering.

[19]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[20]  Jun'ichi Tsujii,et al.  Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension , 2012, TALIP.

[21]  Tiejun Zhao,et al.  Bilingual lexicon extraction using locally weighted linear regression from comparable corpora , 2015, 2015 International Conference on Asian Language Processing (IALP).

[22]  Azadeh Shakery,et al.  Building a multi-domain comparable corpus using a learning to rank method† , 2016, Natural Language Engineering.

[23]  Ulrich Heid,et al.  A Linguistically Grounded Graph Model for Bilingual Lexicon Extraction , 2010, COLING.

[24]  Akbar Hesabi,et al.  Semi Automatic Development of FarsNet ; The Persian WordNet , 2009 .

[25]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[26]  Krzysztof Marasek,et al.  Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs , 2015, ArXiv.

[27]  Tiejun Zhao,et al.  Extracting parallel phrases from comparable corpora , 2014, 2014 International Conference on Asian Language Processing (IALP).

[28]  Chris Quirk,et al.  Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction , 2007 .

[29]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[30]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[31]  William W. Cohen,et al.  Graph Based Similarity Measures for Synonym Extraction from Parsed Text , 2012, TextGraphs@ACL.

[32]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[33]  Yu Zhou,et al.  An Efficient Framework to Extract Parallel Units from Comparable Data , 2013, NLPCC.

[34]  Josef van Genabith,et al.  Mining Parallel Resources for Machine Translation from Comparable Corpora , 2015, CICLing.

[35]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[36]  Pablo Gamallo Otero,et al.  Wikipedia as Multilingual Source of Comparable Corpora , 2011 .

[37]  Chris Callison-Burch,et al.  Supervised Bilingual Lexicon Induction with Multiple Monolingual Signals , 2013, NAACL.

[38]  Hal Daumé,et al.  Domain Adaptation for Machine Translation by Mining Unseen Words , 2011, ACL.

[39]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Jirí Navrátil,et al.  Graph-Based Unsupervised Learning of Word Similarities Using Heterogeneous Feature Types , 2013, TextGraphs@EMNLP.

[41]  Santanu Pal,et al.  Automatic Building and Using Parallel Resources for SMT from Comparable Corpora , 2014, HyTra@EACL.

[42]  Chenhui Chu,et al.  Accurate Parallel Fragment Extraction from Quasi–Comparable Corpora using Alignment Model and Translation Lexicon , 2013, IJCNLP.

[43]  Hua Wu,et al.  Pivot language approach for phrase-based statistical machine translation , 2007, ACL.

[44]  Emmanuel Morin,et al.  Bilingual Lexicon Extraction from Comparable Corpora as Metasearch , 2011, BUCC@ACL.

[45]  Chiranjib Bhattacharyya,et al.  Corpus-Based Translation Induction in Indian Languages Using Auxiliary Language Corpora from Wikipedia , 2017, TALLIP.

[46]  A. Gispert,et al.  Catalan-English Statistical Machine Translation without Parallel Corpus : Bridging through Spanish , 2006 .

[47]  Jan Niehues,et al.  Using Wikipedia to translate domain-specific terms in SMT , 2011, IWSLT.

[48]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[49]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[50]  Jing Li,et al.  Learning distributed word representation with multi-contextual mixed embedding , 2016, Knowl. Based Syst..

[51]  Philipp Koehn,et al.  Learning a Translation Lexicon from Monolingual Corpora , 2002, ACL 2002.

[52]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[53]  Ari Rappoport,et al.  Bilingual Lexicon Generation Using Non-Aligned Signatures , 2010, ACL.

[54]  Junichi Tsujii,et al.  Bilingual Dictionary Extraction from Wikipedia , 2009, MTSUMMIT.

[55]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[56]  Simone Paolo Ponzetto,et al.  BabelNetXplorer: a platform for multilingual lexical knowledge base access and exploration , 2012, WWW.

[57]  Yang Feng,et al.  Automatic Bilingual Phrase Extraction from Comparable Corpora , 2012, COLING.

[58]  Michael Zock,et al.  Automatic Dictionary Expansion Using Non-parallel Corpora , 2008, GfKl.

[59]  Mona T. Diab,et al.  A statistical word-level translation model for comparable corpora , 2000 .

[60]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[61]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[62]  Regina Barzilay,et al.  A Statistical Model for Lost Language Decipherment , 2010, ACL.

[63]  Shahram Khadivi,et al.  A study to find influential parameters on a Farsi-English statistical machine translation system , 2010, 2010 5th International Symposium on Telecommunications.

[64]  Éric Gaussier,et al.  Clustering Comparable Corpora For Bilingual Lexicon Extraction , 2011, ACL.

[65]  David Yarowsky,et al.  Toward Statistical Machine Translation without Parallel Corpora , 2012, EACL 2012.

[66]  Shaoqi Wang,et al.  A Mutual Iterative Enhancement Model for Simultaneous Comparable Corpus and Bilingual Lexicon Construction , 2016 .

[67]  Marie-Francine Moens,et al.  Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge , 2012, EACL.

[68]  David M. Blei,et al.  Multilingual Topic Models for Unaligned Text , 2009, UAI.

[69]  Zhenxin Yang,et al.  Building Comparable Corpora Based on Bilingual LDA Model , 2013, ACL.

[70]  Takahiro Hara,et al.  An Approach for Extracting Bilingual Terminology from Wikipedia , 2008, DASFAA.

[71]  Sophia Ananiadou,et al.  Using a Random Forest Classifier to Compile Bilingual Dictionaries of Technical Terms from Comparable Corpora , 2014, EACL.

[72]  J. Zheng Q. Cui Research on the integration and services model of geospatial information resources based on SOA : Jiong Zheng & Qing Cui , 2015 .

[73]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[74]  Chenhui Chu,et al.  Improving Statistical Machine Translation Accuracy Using Bilingual Lexicon Extractionwith Paraphrases , 2014, PACLIC.

[75]  Marie-Francine Moens,et al.  Identifying Word Translations from Comparable Corpora Using Latent Topic Models , 2011, ACL.

[76]  Nizar Habash,et al.  Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation , 2013, ACL 2013.

[77]  David Yarowsky,et al.  Inducing Translation Lexicons via Diverse Similarity Measures and Bridge Languages , 2002, CoNLL.

[78]  Shahram Khadivi,et al.  Extracting parallel fragments from comparable documents using a generative model , 2019, Comput. Speech Lang..

[79]  Shankar Kumar,et al.  Improving Word Alignment with Bridge Languages , 2007, EMNLP.

[80]  Sophia Ananiadou,et al.  Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora , 2014, EMNLP.

[81]  Éric Gaussier,et al.  Bilingual terminology extraction : an approach based on a multilingual thesaurus applicable to comparable corpora , 2002 .

[82]  Dragomir R. Radev,et al.  Simultaneous Similarity Learning and Feature-Weight Learning for Document Clustering , 2011, Graph-based Methods for Natural Language Processing.

[83]  Hiroyuki Kaji,et al.  Automatic Construction of a Japanese-Chinese Dictionary via English , 2008, LREC.

[84]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[85]  Oren Etzioni,et al.  Compiling a Massive, Multilingual Dictionary via Probabilistic Inference , 2009, ACL.

[86]  Chenhui Chu,et al.  Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese--Japanese Wikipedia , 2016, TALLIP.

[87]  David Yarowsky,et al.  Improving Translation Lexicon Induction from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences , 2009, CoNLL.

[88]  Md. Mustafizur Rahman,et al.  Neural Information Retrieval: A Literature Review , 2016, ArXiv.

[89]  Darja Fi,et al.  Bilingual Lexicon Extraction from Comparable Corpora for Closely Related Languages , 2011 .

[90]  Emmanuel Morin,et al.  Attempting to Bypass Alignment from Comparable Corpora via Pivot Language , 2015, BUCC@ACL/IJCNLP.

[91]  Yang Liu,et al.  Iterative Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora , 2015, IJCAI.

[92]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[93]  Chris Callison-Burch,et al.  Combining Bilingual and Comparable Corpora for Low Resource Machine Translation , 2013, WMT@ACL.

[94]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[95]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[96]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[97]  David Yarowsky,et al.  Multipath Translation Lexicon Induction via Bridge Languages , 2001, NAACL.

[98]  Gholamreza Haffari,et al.  Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation , 2013, ACL.