The TransBank Aligner: Cross-Sentence Alignment with Deep Neural Networks

Sentence-aligned parallel bilingual corpora are the main and sometimes the only required resource for training Statistical and Neural Machine Translation systems. We propose an end-to-end deep neural architecture for sentence alignment. In addition to one-to-one alignment, our aligner can perform cross alignment as well. We used three language pairs from Europarl corpus and an English-Persian corpus to generate an alignment dataset. Using this dataset, we tested our system both in isolation and in an SMT system. In both settings, we obtained significantly better results compared to two competitive baselines.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Philippe Langlais,et al.  Extracting Parallel Sentences with Bidirectional Recurrent Neural Networks to Improve Machine Translation , 2018, COLING.

[3]  Y. Gambier Translation strategies and tactics , 2010 .

[4]  Kenneth Ward Church,et al.  Aligning Parallel Texts : Do Methods Developed for English-French Generalize to Asian Languages? , 1993 .

[5]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[6]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[7]  Fabienne Braune,et al.  Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora , 2010, COLING.

[8]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[9]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[10]  Kenneth Ward Church Char_align: A Program for Aligning Parallel Texts at the Character Level , 1993, ACL.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Bowen Zhou,et al.  Attentive Pooling Networks , 2016, ArXiv.

[13]  Bowen Zhou,et al.  Applying deep learning to answer selection: A study and an open task , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Krzysztof Marasek,et al.  A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation , 2015, WorldCIST.

[16]  Ahmad Aghaebrahimian,et al.  Linguistically-Based Deep Unstructured Question Answering , 2018, CoNLL.

[17]  Rico Sennrich,et al.  Iterative, MT-based Sentence Alignment of Parallel Texts , 2011, NODALIDA.

[18]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[19]  Ahmad Aghaebrahimian Deep Neural Networks at the Service of Multilingual Parallel Sentence Extraction , 2018, COLING.

[20]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[21]  Andy Way,et al.  Post-editing Effort of a Novel With Statistical and Neural Machine Translation , 2018, Front. Digit. Humanit..

[22]  Christoph Tillmann,et al.  A Beam-Search Extraction Algorithm for Comparable Data , 2009, ACL.

[23]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[24]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[25]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[26]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[27]  Heshaam Faili,et al.  TEP: Tehran English-Persian Parallel Corpus , 2011, CICLing.

[28]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[29]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[30]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[31]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.