论文信息 - Otedama: Fast Rule-Based Pre-Ordering for Machine Translation - 字舞流文

Otedama: Fast Rule-Based Pre-Ordering for Machine Translation

Abstract We present Otedama, a fast, open-source tool for rule-based syntactic pre-ordering, a well established technique in statistical machine translation. Otedama implements both a learner for pre-ordering rules, as well as a component for applying these rules to parsed sentences. Our system is compatible with several external parsers and capable of accommodating many source and all target languages in any machine translation paradigm which uses parallel training data. We demonstrate improvements on a patent translation task over a state-of-the-art English-Japanese hierarchical phrase-based machine translation system. We compare Otedama with an existing syntax-based pre-ordering system, showing comparable translation performance at a runtime speedup of a factor of 4.5-10.

Benjamin Körner | Stefan Riezler | Julian Hitschler | Mayumi Ohta | Sariya Karimova | Laura Jehl | S. Riezler | Julian Hitschler | Laura Jehl | Mayumi Ohta | Sariya Karimova | Benjamin Körner

[1] Ted Pedersen,et al. An Evaluation Exercise for Word Alignment , 2003, ParallelTexts@NAACL-HLT.

[2] Steven J. Clancy,et al. The Chain of Being and Having in Slavic , 2010 .

[3] Association Focus , 1999 .

[4] Víctor M. Sánchez-Cartagena,et al. An open-source toolkit for integrating shallow-transfer rules into phrase-based statistical machine translation , 2012, FREEOPMT.

[5] Miguel Rios,et al. Language Adaptation for Extending Post-Editing Estimates for Closely Related Languages , 2016, Prague Bull. Math. Linguistics.

[6] Taro Watanabe,et al. Inducing a Discriminative Parser to Optimize Machine Translation Reordering , 2012, EMNLP.

[7] Petr Sgall. Towards a Definition of Focus and Topic , 1981 .

[8] Víctor M. Sánchez-Cartagena,et al. Integrating Rules and Dictionaries from Shallow-Transfer Machine Translation into Phrase-Based Statistical Machine Translation , 2016, J. Artif. Intell. Res..

[9] Jan Haji,et al. Morphological and Syntactic Tagging of the Prague Dependency Treebank , 1999 .

[10] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[11] Christian Chiarcos,et al. A New Hybrid Dependency Parser for German , 2009 .

[12] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[13] Jirí Mírovský,et al. Sentence Modality Assignment in the Prague Dependency Treebank , 2012, TSD.

[14] Chris Callison-Burch,et al. Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[15] Charles J. Fillmore,et al. Form And Meaning In Language , 2003 .

[16] Andy Way,et al. The ADAPT Bilingual Document Alignment system at WMT16 , 2016, WMT.

[17] Eleftherios Avramidis,et al. Correlating decoding events with errors in Statistical Machine Translation , 2014, ICON.

[18] Chris Dyer,et al. Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT , 2012, ACL.

[19] Yunqian Ma,et al. Practical selection of SVM parameters and noise estimation for SVM regression , 2004, Neural Networks.

[20] Aleš Horák,et al. Lexicographic Tools to Build New Encyclopaedia of the Czech Language , 2016, Prague Bull. Math. Linguistics.

[21] Marie Mikulová,et al. Reconstructions of Deletions in a Dependency-based Description of Czech: Selected Issues , 2015, DepLing.

[22] Hermann Ney,et al. HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[23] Phil Blunsom,et al. A Note on the Implementation of Hierarchical Dirichlet Processes , 2009, ACL/IJCNLP.

[24] M. F.,et al. Bibliography , 1985, Experimental Gerontology.

[25] Eva Hajicová,et al. Introducing the Prague Discourse Treebank 1.0 , 2013, IJCNLP.

[26] Eleftherios Avramidis,et al. Qualitative: Python Tool for MT Quality Estimation Supporting Server Mode and Hybrid MT , 2016, Prague Bull. Math. Linguistics.

[27] Jirí Mírovský,et al. Genres in the Prague Discourse Treebank , 2014, LREC.

[28] Daniel Marcu,et al. Capitalizing Machine Translation , 2006, NAACL.

[29] Gareth J. F. Jones,et al. Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval , 2016, ArXiv.

[30] Anna Nedoluzhko,et al. Rozšířená textová koreference a asociační anafora (koncepce anotace českých dat v pražském závislostním korpusu) , 2010 .

[31] Philipp Koehn,et al. Further Meta-Evaluation of Machine Translation , 2008, WMT@ACL.

[32] Hermann Ney,et al. Symmetric Word Alignments for Statistical Machine Translation , 2004, COLING.

[33] Dmitriy Genzel,et al. Automatically Learning Source-side Reordering Rules for Large Scale Machine Translation , 2010, COLING.

[34] Sárka Zikánová. What do the data in Prague Dependency Treebank say about systemic ordering in Czech? , 2006, Prague Bull. Math. Linguistics.

[35] John D. Lafferty,et al. Information retrieval as statistical translation , 1999, SIGIR '99.

[36] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[37] Petr Pajas,et al. System for Querying Syntactically Annotated Corpora , 2009, ACL/IJCNLP.

[38] Matthew G. Snover,et al. A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[39] C. Fillmore. The case for case reopened , 1977 .

[40] Wekesa L Maloba,et al. Aspects of discourse structure , 2012 .

[41] Jirí Mírovský,et al. How Dependency Trees and Tectogrammatics Help Annotating Coreference and Bridging Relations in Prague Dependency Treebank , 2013, DepLing.

[42] Christopher C. Yang,et al. Automatic construction of English/Chinese parallel corpora , 2003, J. Assoc. Inf. Sci. Technol..

[43] Jirí Mírovský,et al. Connective-Based Measuring of the Inter-Annotator Agreement in the Annotation of Discourse in PDT , 2010, COLING.

[44] Kevin Knight,et al. Automatic Prediction of Parser Accuracy , 2008, EMNLP.

[45] Eleftherios Avramidis,et al. Comparative Quality Estimation: Automatic Sentence-Level Ranking of Multiple Machine Translation Outputs , 2012, COLING.

[46] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[47] Slav Petrov,et al. Source-Side Classifier Preordering for Machine Translation , 2013, EMNLP.

[48] D. Blackwell. Conditional Expectation and Unbiased Sequential Estimation , 1947 .

[49] Lucia Specia,et al. Multi-level Translation Quality Prediction with QuEst++ , 2015, ACL.

[50] Tetsuji Nakagawa. Efficient Top-Down BTG Parsing for Machine Translation Preordering , 2015, ACL.

[51] Dacheng Tao,et al. A Survey on Multi-view Learning , 2013, ArXiv.

[52] Alexander M. Fraser,et al. Squibs and Discussions: Measuring Word Alignment Quality for Statistical Machine Translation , 2007, CL.

[53] Livio Robaldo,et al. The Penn Discourse TreeBank 2.0. , 2008, LREC.

[54] Lucia Specia,et al. QuEst - A translation quality estimation framework , 2013, ACL.

[55] Eleftherios Avramidis,et al. DFKI’s system for WMT16 IT-domain task, including analysis of systematic errors , 2016, WMT.

[56] Daniel Gildea,et al. Improving the IBM Alignment Models Using Variational Bayes , 2012, ACL.

[57] Jacob Cohen. A Coefficient of Agreement for Nominal Scales , 1960 .

[58] Hans Uszkoreit,et al. Multi-Objective Optimization for the Joint Disambiguation of Nouns and Named Entities , 2015, ACL.

[59] Eleftherios Avramidis,et al. Quality estimation for Machine Translation output using linguistic analysis and decoding features , 2012, WMT@NAACL-HLT.

[60] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[61] Marco Turchi,et al. Automatic Annotation of Machine Translation Datasets with Binary Quality Judgements , 2014, LREC.

[62] Lucien Tesnière. Éléments de syntaxe structurale , 1959 .

[63] Lucia Specia,et al. Machine translation evaluation versus quality estimation , 2010, Machine Translation.

[64] Eduard Bejček,et al. Annotation of multiword expressions in the Prague dependency treebank , 2010, IJCNLP.

[65] Petr Sgall,et al. The Meaning Of The Sentence In Its Semantic And Pragmatic Aspects , 1986 .

[66] Eva Hajicová,et al. Corpus Annotation on the Tectogrammatical Layer: Summarizing of the First Stages of Evaluations , 2002, Prague Bull. Math. Linguistics.

[67] Eleftherios Avramidis,et al. Interoperability in MT Quality Estimation or wrapping useful stuff in various ways , 2016 .

[68] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[69] Jean Carletta,et al. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.

[70] Gonzalo Iglesias,et al. Fast and Accurate Preordering for SMT using Neural Networks , 2015, HLT-NAACL.

[71] Yoshua Bengio,et al. Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[72] Raquel Fernández,et al. Invented antonyms: Esperanto as a semantic lab , 2010 .

[73] Joel D. Martin,et al. Word Alignment for Languages with Scarce Resources , 2005, ParallelText@ACL.

[74] Jörg Tiedemann,et al. Bitext Alignment , 2011, Synthesis Lectures on Human Language Technologies.

[75] Víctor M. Sánchez-Cartagena,et al. A generalised alignment template formalism and its application to the inference of shallow-transfer machine translation rules from scarce bilingual corpora , 2015, Comput. Speech Lang..

[76] Helmut Schmid. Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors , 2004, COLING.

[77] Lars Ahrenberg,et al. A Gold Standard for English-Swedish Word Alignment , 2011, NODALIDA.

[78] Blaz Zupan,et al. Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[79] Jirí Mírovský,et al. Does Tectogrammatics Help the Annotation of Discourse? , 2012, COLING.

[80] István Varga,et al. Transfer rule generation for a Japanese-Hungarian machine translation system , 2009, MTSUMMIT.

[81] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[82] Ventsislav Zhechev. Machine Translation Infrastructure and Post-editing Performance at Autodesk , 2012, AMTA.

[83] Miles Osborne,et al. Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[84] Stefan Riezler,et al. Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus , 2012, IRFC.

[85] Dan Klein,et al. Accurate Unlexicalized Parsing , 2003, ACL.

[86] Peri Bhaskararao,et al. Non-nominative Subjects: Volume 1 , 2004 .

[87] Marta R. Costa-jussà,et al. Description of the Chinese-to-Spanish Rule-Based Machine Translation System Developed Using a Hybrid Combination of Human Annotation and Statistical Techniques , 2016, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[88] Ralph Grishman,et al. Annotating Noun Argument Structure for NomBank , 2004, LREC.

[89] Min-Yen Kan,et al. Perspectives on crowdsourcing annotations for natural language processing , 2012, Language Resources and Evaluation.

[90] Silvie Cinková,et al. Tectogrammatical Annotation of the Wall Street Journal , 2009, Prague Bull. Math. Linguistics.

[91] Benjamin Lecouteux,et al. An Open Source Toolkit for Word-level Confidence Estimation in Machine Translation , 2015 .

[92] Mark Hopkins,et al. Source-side Preordering for Translation using Logistic Regression and Depth-first Branch-and-Bound Search , 2014, EACL.

[93] Chris Quirk. Exact Maximum Inference for the Fertility Hidden Markov Model , 2013, ACL.

[94] Jarmila Panevová,et al. The Role of Grammatical Constraints in Lexical Component in Functional Generative Description , 2014 .

[95] Jan Hajic,et al. Linguistic Annotation : from Links to Cross-Layer Lexicons , 2003 .

[96] Jürgen Schmidhuber,et al. A Python Experiment Suite , 2011 .

[97] Philipp Koehn,et al. Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[98] Yvette Graham,et al. Improving Evaluation of Machine Translation Quality Estimation , 2015, ACL.

[99] Daniel Gildea,et al. A Fast Fertility Hidden Markov Model for Word Alignment Using MCMC , 2010, EMNLP.

[100] Eva Hajicová,et al. The Role of the Hierarchy of Activation in the Process of Natural Language Understanding , 1982, COLING.

[101] Vladimir Eidelman,et al. cdec: A Decoder, Alignment, and Learning Framework for Finite- State and Context-Free Translation Models , 2010, ACL.

[102] Lucia Specia,et al. MARMOT: A Toolkit for Translation Quality Estimation at the Word Level , 2016, LREC.

[103] Pavlína Jínová,et al. Semi-Automatic Annotation of Intra-Sentential Discourse Relations in PDT , 2012 .

[104] Christopher D. Manning,et al. Extentions to HMM-based Statistical Word Alignment Models , 2002, EMNLP.

[105] M. Utiyama,et al. A Japanese-English patent parallel corpus , 2007, MTSUMMIT.

[106] Y. Seginer,et al. Learning syntactic structure , 2007 .

[107] Livio Robaldo,et al. The Penn Discourse Treebank 2.0 Annotation Manual , 2007 .

[108] Dragos Stefan Munteanu,et al. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[109] Sven Schmeier,et al. Qualitative: Open source Python tool for Quality Estimation over multiple Machine Translation outputs , 2014, Prague Bull. Math. Linguistics.

[110] Mikel L. Forcada,et al. Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation , 2007, Machine Translation.

[111] Adrian F. M. Smith,et al. Gibbs Sampling for Marginal Posterior Expectations , 1991 .

[112] Hilary Putnam,et al. Mind, Language and Reality: Some issues in the theory of grammar , 1975 .

[113] Phil Blunsom,et al. A Simple Model for Learning Multilingual Compositional Semantics , 2014, ICLR.

[114] Dan Klein,et al. Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output , 2012, EMNLP.

[115] Matthew J. Saltzman,et al. Computational Experience with a Software Framework for Parallel Integer Programming , 2009, INFORMS J. Comput..

[116] Roman Jakobson,et al. Structure of Language and Its Mathematical Aspects , 1961 .

[117] W. Bruce Croft,et al. A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[118] Noah A. Smith,et al. A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[119] Helmut Schmidt,et al. Probabilistic part-of-speech tagging using decision trees , 1994 .

[120] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[121] Mikel L. Forcada,et al. Inferring Shallow-Transfer Machine Translation Rules from Small Parallel Corpora , 2014, J. Artif. Intell. Res..

[122] Petr Pajas,et al. PDT-VALLEX : Creating a Large-coverage Valency Lexicon for Treebank Annotation , 2003 .

[123] Ondrej Dusek,et al. MTMonkey: A Scalable Infrastructure for a Machine Translation Web Service , 2013, Prague Bull. Math. Linguistics.

[124] Nikola Ljubesic,et al. Collaborative Development of a Rule-Based Machine Translator between Croatian and Serbian , 2016, EAMT.

[125] Eva Hajičová,et al. Issues of Sentence Structure and Discourse Patterns. , 1993 .

[126] Sabine Brants,et al. The TIGER Treebank , 2001 .

[127] Adam Rambousek. Creation and Management of Structured Language Resources , 2015 .

[128] Stephan Vogel,et al. Adaptive parallel sentences mining from web bilingual news collection , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[129] Jörg Tiedemann,et al. Efficient Word Alignment with Markov Chain Monte Carlo , 2016, Prague Bull. Math. Linguistics.

[130] Jirí Havelka,et al. Identification of Topic and Focus in Czech: Evaluation of Manual Parallel Annotations , 2007, Prague Bull. Math. Linguistics.

[131] Eugene Charniak,et al. Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[132] David Yarowsky,et al. Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[133] Eva Hajičová,et al. On an Apparent Freedom of Czech Word Order . A Case Study , 2015 .

[134] Noah A. Smith,et al. The Web as a Parallel Corpus , 2003, CL.

[135] Zdeňka Urešová. Valence sloves v Pražském závislostním korpusu , 2012 .

[136] Ruhi Sarikaya,et al. Improving Statistical Machine Translation Using Bayesian Word Alignment and Gibbs Sampling , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[137] David Chiang,et al. Hierarchical Phrase-Based Translation , 2007, CL.

[138] Martha Palmer,et al. From TreeBank to PropBank , 2002, LREC.

[139] Robert Östling,et al. Bayesian Models for Multilingual Word Alignment , 2015 .

[140] Philipp Koehn,et al. Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[141] J. Pitman,et al. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[142] Ales Horák,et al. Management and Publishing of Multimedia Dictionary of the Czech Sign Language , 2015, NLDB.

[143] Víctor M. Sánchez-Cartagena,et al. RuLearn: an Open-source Toolkit for the Automatic Inference of Shallow-transfer Rules for Machine Translation , 2016, Prague Bull. Math. Linguistics.

[144] Serge Verlinde,et al. Data access revisited: The Interactive Language Toolbox , 2012 .

[145] Holger Schwenk,et al. Building and using multimodal comparable corpora for machine translation† , 2016, Natural Language Engineering.

[146] Andreas van Cranenburgh. Enriching Data-Oriented Parsing by blending morphology and syntax , 2010 .

[147] Philipp Koehn,et al. Synthesis Lectures on Human Language Technologies , 2016 .

[148] Phil Blunsom,et al. A Systematic Bayesian Treatment of the IBM Alignment Models , 2013, HLT-NAACL.

[149] Marie Mikulová,et al. Deletions and Node Reconstructions in a Dependency-Based Multilevel Annotation Scheme , 2015, CICLing.

[150] Miguel Rios,et al. Large Scale Translation Quality Estimation , 2015 .

[151] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[152] Marie Mikulová,et al. Ways of Evaluation of the Annotators in Building the Prague Czech-English Dependency Treebank , 2010, LREC.

[153] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[154] Francis M. Tyers,et al. Apertium: a free/open-source platform for rule-based machine translation , 2011, Machine Translation.

[155] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.

[156] Eva Hajicová,et al. Annotators' Agreement: The Case of Topic-Focus Articulation , 2004, LREC.

[157] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[158] Thorsten Brants,et al. Inter-annotator Agreement for a German Newspaper Corpus , 2000, LREC.

[159] Djoerd Hiemstra,et al. Using language models for information retrieval , 2001 .

[160] Philipp Koehn,et al. Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[161] Patrick Hanks. Corpus pattern analysis , 2004 .

[162] Katrin Erk,et al. The SALSA Corpus: a German Corpus Resource for Lexical Semantics , 2006, LREC.

[163] Murat Saraclar,et al. Bayesian Word Alignment for Statistical Machine Translation , 2011, ACL.

[164] Sven Tarp. Theoretical challenges in the transition from lexicographical p-works to e-tools , 2012 .

[165] Andrew Y. Ng,et al. Parsing with Compositional Vector Grammars , 2013, ACL.

[166] Petr Sgall,et al. A functional approach to syntax: in generative description of language , 1969, Mathematical linguistics and automatic language processing.

[167] Philipp Koehn,et al. (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[168] Magdalena Rysova. Verbs of Saying with a Textual Connecting Function in the Prague Discourse Treebank , 2014, LREC.

[169] Maja Popovic. Hjerson: An Open Source Tool for Automatic Error Classification of Machine Translation Output , 2011, Prague Bull. Math. Linguistics.

[170] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[171] Aixia Guo,et al. Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[172] Dan Klein,et al. Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[173] Tie-Yan Liu,et al. Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[174] Rajat Raina,et al. Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[175] P. Sgall,et al. Topic-focus articulation, tripartite structures, and semantic content , 1998 .

[176] Alon Lavie,et al. Unsupervised Word Alignment with Arbitrary Features , 2011, ACL.

[177] Magdalena Rysova. Alternative Lexicalizations of Discourse Connectives in Czech , 2012, LREC.

[178] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[179] Ales Horák,et al. DEB Platform Deployment - Current Applications , 2007, RASLAN.

[180] P. Luelsdorff. The Prague School of Structural and Functional Linguistics , 1994 .

[181] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[182] Eva Hajičová,et al. On scalarity in information structure , 2012 .

[183] Eleftherios Avramidis,et al. DFKI’s experimental hybrid MT system for WMT 2015 , 2015, WMT@EMNLP.

[184] Jan Hajic,et al. Annotation Lexicons: Using the Valency Lexicon for Tectogrammatical Annotation , 2003, Prague Bull. Math. Linguistics.

[185] Philipp Koehn,et al. Findings of the 2015 Workshop on Statistical Machine Translation , 2015, WMT@EMNLP.

[186] Philipp Koehn,et al. Findings of the 2012 Workshop on Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[187] Andy Way,et al. FaDA: Fast Document Aligner using Word Embedding , 2016, Prague Bull. Math. Linguistics.

[188] Jan Haji. Complex Corpus Annotation: The Prague Dependency Treebank , 2005 .