论文信息 - A Hybrid Machine Translation Framework for an Improved Translation Workflow

A Hybrid Machine Translation Framework for an Improved Translation Workflow

A Hybrid Machine Translation Framework for an Improved Translation Workflow by Santanu Pal Doctor of Philosophy Computerlinguistik, Sprachwissenschaft und Sprachtechnologie Universität des Saarlandes Over the past few decades, due to a continuing surge in the amount of content being translated and ever increasing pressure to deliver high quality and high throughput translation, translation industries are focusing their interest on adopting advanced technologies such as machine translation (MT), and automatic post-editing (APE) in their translation workflows. Despite the progress of the technology, the roles of humans and machines essentially remain intact as MT/APE are moving from the peripheries of the translation field closer towards collaborative human-machine based MT/APE in modern translation workflows. Professional translators increasingly become post-editors correcting raw MT/APE output instead of translating from scratch which in turn increases productivity in terms of translation speed. The last decade has seen substantial growth in research and development activities on improving MT; usually concentrating on selected aspects of workflows starting from training data pre-processing techniques to core MT processes to post-editing methods. To date, however, complete MT workflows are less investigated than the core MT processes. In the research presented in this thesis, we investigate avenues towards achieving improved MT workflows. We study how different MT paradigms can be utilized and integrated to best effect. We also investigate how different upstream and downstream component technologies can be hybridized to achieve overall improved MT. Finally we include an investigation into human-machine collaborative MT by taking humans in the loop. In many of (but not all) the experiments presented in this thesis we focus on data scenarios provided by low resource language settings. German Summary (Zusammenfassung) Aufgrund des stetig ansteigenden Übersetzungsvolumens in den letzten Jahrzehnten und gleichzeitig wachsendem Druck hohe Qualität innerhalb von kürzester Zeit liefern zu müssen sind Übersetzungsdienstleister darauf angewiesen, moderne Technologien wie Maschinelle Übersetzung (MT) und automatisches Post-Editing (APE) in den Übersetzungsworkflow einzubinden. Trotz erheblicher Fortschritte dieser Technologien haben sich die Rollen von Mensch und Maschine kaum verändert. MT/APE ist jedoch nunmehr nicht mehr nur eine Randerscheinung, sondern wird im modernen Übersetzungsworkflow zunehmend in Zusammenarbeit von Mensch und Maschine eingesetzt. Fachübersetzer werden immer mehr zu Post-Editoren und korrigieren den MT/APE-Output, statt wie bisher Übersetzungen komplett neu anzufertigen. So kann die Produktivität bezüglich der Übersetzungsgeschwindigkeit gesteigert werden. Im letzten Jahrzehnt hat sich in den Bereichen Forschung und Entwicklung zur Verbesserung von MT sehr viel getan: Einbindung des vollständigen Übersetzungsworkflows von der Vorbereitung der Trainingsdaten über den eigentlichen MT-Prozess bis hin zu Post-Editing-Methoden. Der vollständige Übersetzungsworkflow wird jedoch aus Datenperspektive weit weniger berücksichtigt als der eigentliche MT-Prozess. In dieser Dissertation werden Wege hin zum idealen oder zumindest verbesserten MT-Workflow untersucht. In den Experimenten wird dabei besondere Aufmertsamfit auf die speziellen Belange von sprachen mit geringen ressourcen gelegt. Es wird untersucht wie unterschiedliche MT-Paradigmen verwendet und optimal integriert werden können. Des Weiteren wird dargestellt wie unterschiedliche vorund nachgelagerte Technologiekomponenten angepasst werden können, um insgesamt einen besseren MT-Output zu generieren. Abschließend wird gezeigt wie der Mensch in den MT-Workflow intergriert werden kann. Das Ziel dieser Arbeit ist es verschiedene Technologiekomponenten in den MT-Workflow zu integrieren um so einen verbesserten Gesamtworkflow zu schaffen. Hierfür werden hauptsächlich Hybridisierungsansätze verwendet. In dieser Arbeit werden außerdem Möglichkeiten untersucht, Menschen effektiv als Post-Editoren einzubinden. Die hierbei gewonnenen Übersetzungsprozessdaten

Santanu Pal | Santanu Pal

[1] Pierre Zweigenbaum,et al. Identifying bilingual Multi-Word Expressions for Statistical Machine Translation , 2012, LREC.

[2] Marco Turchi,et al. WMT16 APE Shared Task Data , 2016 .

[3] Ben Taskar,et al. Alignment by Agreement , 2006, NAACL.

[4] Liang Huang,et al. Statistical Syntax-Directed Translation with Extended Domain of Locality , 2006, AMTA.

[5] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[6] Spyridon Matsoukas,et al. Trait-Based Hypothesis Selection For Machine Translation , 2012, HLT-NAACL.

[7] Wojciech Zaremba,et al. An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[8] Mihaela Vela,et al. Quantifying the Influence of MT Output in the Translators’ Performance: A Case Study in Technical Translation , 2014, HaCaT@EACL.

[9] Jianfeng Gao,et al. Indirect-HMM-based Hypothesis Alignment for Combining Outputs from Machine Translation Systems , 2008, EMNLP.

[10] Phil Blunsom,et al. Recurrent Continuous Translation Models , 2013, EMNLP.

[11] Haizhou Li,et al. Forest-based Tree Sequence to String Translation Model , 2009, ACL.

[12] Marine Carpuat,et al. Task-based Evaluation of Multiword Expressions: a Pilot Study in Statistical Machine Translation , 2010, NAACL.

[13] Andy Way,et al. Handling Named Entities and Compound Verbs in Phrase-Based Statistical Machine Translation , 2010, MWE@COLING.

[14] Josef van Genabith,et al. Can Translation Memories afford not to use paraphrasing? , 2015, EAMT.

[15] Josef van Genabith,et al. Multi-Engine and Multi-Alignment Based Automatic Post-Editing and its Impact on Translation Productivity , 2016, COLING.

[16] Josef van Genabith,et al. Statistical Post-Editing for a Statistical MT System , 2011, MTSUMMIT.

[17] Nadir Durrani,et al. The Operation Sequence Model—Combining N-Gram-Based and Phrase-Based Statistical Machine Translation , 2015, CL.

[18] Josef van Genabith,et al. Neural Automatic Post-Editing Using Prior Alignment and Reranking , 2017, EACL.

[19] Sara Stymne,et al. Alignment-based reordering for SMT , 2012, LREC.

[20] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[21] Markus Freitag,et al. Review of Hypothesis Alignment Algorithms for MT System Combination via Confusion Network Decoding , 2012, WMT@NAACL-HLT.

[22] Marcello Federico. Measuring User Productivity in Machine Translation Enhanced Computer Assisted Translation , 2012, AMTA.

[23] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[24] Nitin Madnani,et al. Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[25] Maarit Koponen,et al. Is Machine Translation Post-editing Worth the Effort?: A Survey of Research into Post-editing and Effort , 2016 .

[26] Cyril Goutte. Automatic Evaluation of Machine Translation Quality , 2006 .

[27] Panagiotis Kanavos,et al. Integrating Machine Translation with Translation Memory: A Practical Approach , 2010 .

[28] Philipp Koehn,et al. Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[29] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[30] Mark Steedman,et al. Building Deep Dependency Structures using a Wide-Coverage CCG Parser , 2002, ACL.

[31] Yifan He,et al. Bridging SMT and TM with Translation Recommendation , 2010, ACL.

[32] H. Altay Güvenir,et al. Learning Translation Templates from Bilingual Translation Examples , 2004, Applied Intelligence.

[33] Yang Liu,et al. Extracting Hierarchical Rules from a Weighted Alignment Matrix , 2011, IJCNLP.

[34] Shankar Kumar,et al. Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[35] Mark Steedman,et al. The syntactic process , 2004, Language, speech, and communication.

[36] P. V. S. Avinesh. A Data Mining Approach to Learn Reorder Rules for SMT , 2010, NAACL.

[37] Hermann Ney,et al. Computing Consensus Translation for Multiple Machine Translation Systems Using Enhanced Hypothesis Alignment , 2006, EACL.

[38] Dipankar Das,et al. Automatic Extraction of Complex Predicates in Bengali , 2010, MWE@COLING.

[39] Christof Monz,et al. NeurAlign: Combining Word Alignments Using Neural Networks , 2005, HLT/EMNLP.

[40] Petr Sojka,et al. Software Framework for Topic Modelling with Large Corpora , 2010 .

[41] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[42] Matthew G. Snover,et al. A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[43] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[44] Ana Guerberof Arenas. Productivity and Quality in the Post-editing of Outputs from Translation Memories and Machine Translation , 2008 .

[45] Kevin Knight,et al. Automated Postediting of Documents , 1994, AAAI.

[46] Kristina Toutanova,et al. Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[47] Yang Liu,et al. Tree-to-String Alignment Template for Statistical Machine Translation , 2006, ACL.

[48] Daniel Marcu,et al. Towards a Unified Approach to Memory- and Statistical-Based Machine Translation , 2001, ACL.

[49] Josef van Genabith,et al. USAAR: An Operation Sequential Model for Automatic Statistical Post-Editing , 2016, WMT.

[50] Daniel Marcu,et al. What’s in a translation rule? , 2004, NAACL.

[51] Josef van Genabith,et al. ReVal: A Simple and Effective Machine Translation Evaluation Metric Based on Recurrent Neural Networks , 2015, EMNLP.

[52] Yves Lepage,et al. Purest ever example-based machine translation: Detailed presentation and assessment , 2005, Machine Translation.

[53] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[54] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.

[55] Chris Callison-Burch,et al. Stream-based Translation Models for Statistical Machine Translation , 2010, NAACL.

[56] Graham Neubig,et al. Searching Translation Memories for Paraphrases , 2011, MTSUMMIT.

[57] Partha Pakray. Answer Validation through Textual Entailment , 2011, NLDB.

[58] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[59] Hermann Ney,et al. The RWTH System Combination System for WMT 2010 , 2010, WMT@ACL.

[60] Claire Cardie,et al. SemEval-2014 Task 10: Multilingual Semantic Textual Similarity , 2014, *SEMEVAL.

[61] Hermann Ney,et al. AER: do we need to “improve” our alignments? , 2006, IWSLT.

[62] David M. Blei,et al. Visualizing Topic Models , 2012, ICWSM.

[63] Josef van Genabith,et al. USAAR-SAPE: An English–Spanish Statistical Automatic Post-Editing System , 2015, WMT@EMNLP.

[64] Qun Liu,et al. genCNN: A Convolutional Architecture for Word Sequence Prediction , 2015, ACL.

[65] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[66] Alon Lavie,et al. Multi-engine machine translation guided by explicit word matching , 2005, EAMT.

[67] Josef van Genabith,et al. CATaLog Online: Porting a Post-editing Tool to the Web , 2016, LREC.

[68] Philip Koehn,et al. Statistical Machine Translation , 2010, EAMT.

[69] Timothy Baldwin,et al. Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[70] Nizar Habash,et al. Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment , 2010, ACL.

[71] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[72] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[73] Aravind K. Joshi,et al. Using Information about Multi-word Expressions for the Word-Alignment Task , 2006 .

[74] Nadir Durrani,et al. A Joint Sequence Translation Model with Integrated Reordering , 2011, ACL.

[75] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[76] Nizar Habash. Syntactic preprocessing for statistical machine translation , 2007, MTSUMMIT.

[77] Andy Way,et al. An Augmented Three-Pass System Combination Framework: DCU Combination System for WMT 2010 , 2010, WMT@ACL.

[78] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[79] Haitao Mi,et al. Forest-based Translation Rule Extraction , 2008, EMNLP.

[80] Sivaji Bandyopadhyay,et al. Word Alignment-Based Reordering of Source Chunks in PB-SMT , 2014, LREC.

[81] Josef van Genabith,et al. Searching for Context: a Study on Document-Level Labels for Translation Quality Estimation , 2015, EAMT.

[82] Makoto Nagao,et al. A framework of a mechanical translation between Japanese and English by analogy principle , 1984 .

[83] S. B. Needleman,et al. A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[84] Jacob Cohen. A Coefficient of Agreement for Nominal Scales , 1960 .

[85] Rudolf Rosa,et al. Two-step translation with grammatical post-processing , 2011, WMT@EMNLP.

[86] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[87] J. Smith,et al. EBMT for SMT : A New EBMT-SMT Hybrid , 2009 .

[88] Sudip Kumar Naskar,et al. Mitigating Problems in Analogy-based EBMT with SMT and vice versa: A Case Study with Named Entity Transliteration , 2010, PACLIC.

[89] Soma Paul. Representing Compound Verbs in Indo WordNet , 2009 .

[90] Peng Xu,et al. Using a Dependency Parser to Improve SMT for Subject-Object-Verb Languages , 2009, NAACL.

[91] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[92] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[93] Jinxi Xu,et al. A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model , 2008, ACL.

[94] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[95] Md. Anwarus Salam Khan,et al. UNL Explorer , 2012, COLING.

[96] Sivaji Bandyopadhyay,et al. JU_CSE_TAC: Textual Entailment Recognition System at TAC RTE-6 , 2010, TAC.

[97] Sivaji Bandyopadhyay,et al. Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora , 2013, BUCC@ACL.

[98] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[99] Giuseppe Riccardi,et al. Computing consensus translation from multiple machine translation systems , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[100] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[101] George F. Foster,et al. Batch Tuning Strategies for Statistical Machine Translation , 2012, NAACL.

[102] Yifan He,et al. Combining Multiple Alignments to Improve Machine Translation , 2012, COLING.

[103] Wenhu Chen,et al. Guided Alignment Training for Topic-Aware Neural Machine Translation , 2016, AMTA.

[104] Philipp Koehn,et al. Neural Interactive Translation Prediction , 2016, AMTA.

[105] Christopher D. Manning,et al. The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[106] Sivaji Bandyopadhyay,et al. Shared Task System Description: Measuring the Compositionality of Bigrams using Statistical Methodologies , 2011 .

[107] Graham Neubig,et al. Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers , 2013, ACL.

[108] Hermann Ney,et al. HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[109] Daniel Marcu,et al. SPMT: Statistical Machine Translation with Syntactified Target Language Phrases , 2006, EMNLP.

[110] Lichi Yuan. Language Model Based on Word Clustering , 2006, PACLIC.