Symbolic-to-statistical hybridization: extending generation-heavy machine translation

The last few years have witnessed an increasing interest in hybridizing surface-based statistical approaches and rule-based symbolic approaches to machine translation (MT). Much of that work is focused on extending statistical MT systems with symbolic knowledge and components. In the brand of hybridization discussed here, we go in the opposite direction: adding statistical bilingual components to a symbolic system. Our base system is Generation-heavy machine translation (GHMT), a primarily symbolic asymmetrical approach that addresses the issue of Interlingual MT resource poverty in source-poor/target-rich language pairs by exploiting symbolic and statistical target-language resources. GHMT’s statistical components are limited to target-language models, which arguably makes it a simple form of a hybrid system. We extend the hybrid nature of GHMT by adding statistical bilingual components. We also describe the details of retargeting it to Arabic–English MT. The morphological richness of Arabic brings several challenges to the hybridization task. We conduct an extensive evaluation of multiple system variants. Our evaluation shows that this new variant of GHMT—a primarily symbolic system extended with monolingual and bilingual statistical components—has a higher degree of grammaticality than a phrase-based statistical MT system, where grammaticality is measured in terms of correct verb-argument realization and long-distance dependency translation.

[1]  Timo Järvinen,et al.  A non-projective dependency parser , 1997, ANLP.

[2]  Ralf D. Brown,et al.  Applying Statistical English Language Modelling to Symbolic Machine Translation , 1995, TMI.

[3]  Khalil Sima'an Tree-gram Parsing: Lexical Dependencies and Structural Relations , 2000, ACL.

[4]  EstimationPeter,et al.  The Mathematics of Machine Translation : Parameter , 2004 .

[5]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[6]  Eugene Charniak,et al.  Edit Detection and Parsing for Transcribed Speech , 2001, NAACL.

[7]  Sharon Goldwater,et al.  Improving Statistical MT through Morphological Analysis , 2005, HLT.

[8]  No Value,et al.  Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC) , 2004 .

[9]  Nizar Habash,et al.  Multi-align: Combining Linguistic and Statistical Techniques to Improve Alignments for Adaptable MT , 2004, AMTA.

[10]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[11]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[12]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[13]  V. Cavalli-Sforza,et al.  A Prototype English-to-Arabic Interlingua-based MT system , 2002 .

[14]  David Yarowsky,et al.  Minimally Supervised Morphological Segmentation with Applications to Machine Translation , 2006, AMTA.

[15]  Young-Suk Lee,et al.  Morphological Analysis for Statistical Machine Translation , 2004, NAACL.

[16]  Fei Xia,et al.  Improving a Statistical MT System with Automatically Learned Rewrite Patterns , 2004, COLING.

[17]  John L. Beaven ABSTRACT: Shake-and-Bake Machine Translation , 1992, COLING.

[18]  William H. Press,et al.  Numerical recipes in C , 2002 .

[19]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[20]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[21]  Martha Palmer,et al.  Handling Structural Divergences and Recovering Dropped Arguments in a Korean / English Machine Translation System ? , 2000 .

[22]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[23]  van Gerardus Noord,et al.  Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) , 2010 .

[24]  Srinivas Bangalore,et al.  Exploiting a Probabilistic Hierarchical Model for Generation , 2000, COLING.

[25]  Stephan Vogel,et al.  Bridging the Inflection Morphology Gap for Arabic Statistical Machine Translation , 2006, NAACL.

[26]  Hermann Ney,et al.  Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information , 2004, CL.

[27]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[28]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[29]  Owen Rambow,et al.  A Framework for MT and Multilingual NLG Systems Based on Uniform Lexico-Structural Processing , 2000, ANLP.

[30]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.

[31]  van der Ielka Sluis,et al.  Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04) , 2004 .

[32]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[33]  Philipp Koehn,et al.  Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[34]  J. Mariño,et al.  Syntax-enhanced n-gram-based SMT , 2007, MTSUMMIT.

[35]  Richard Sproat,et al.  Review of PC-KIMMO: a two-level processor for morphological analysis by Evan L. Antworth. Summer Institute of Linguistics 1990 , 1991 .

[36]  Philip Resnik,et al.  Evaluating Multilingual Gisting of Web Pages , 1997, ArXiv.

[37]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[38]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[39]  Ralph Grishman,et al.  NOMLEX: a lexicon of nominalizations , 1998 .

[40]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[41]  Vasileios Hatzivassiloglou,et al.  Two-Level, Many-Paths Generation , 1995, ACL.

[42]  Nizar Habash The Use of a Structural N-gram Language Model in Generation-Heavy Hybrid Machine Translation , 2004, INLG.

[43]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[44]  Nizar Habash,et al.  Combination of Arabic Preprocessing Schemes for Statistical Machine Translation , 2006, ACL.

[45]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[46]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[47]  Irene Langkilde-Geary,et al.  Forest-Based Statistical Sentence Generation , 2000, ANLP.

[48]  Hermann Ney,et al.  Morpho-syntactic Arabic Preprocessing for Arabic to English Statistical Machine Translation , 2006, WMT@HLT-NAACL.

[49]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[50]  J. Grimshaw,et al.  Light verbs and 'th'-marking , 1988 .

[51]  Paul N. Bennett,et al.  Reducing boundary friction using translation-fragment overlap , 2003, MTSUMMIT.

[52]  Bonnie J. Dorr,et al.  Interlingual Machine Translation: A Parameterized Approach , 1993, Artif. Intell..

[53]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[54]  Marti A. Hearst,et al.  HLT-NAACL 2003 : Human Language Technology conference of the North American Chapter of the Association for Computational Linguistics : proceedings of the main conference : May 27 to June 1, 2003, Edmonton, Alberta, Canada , 2003 .

[55]  NTT CommunicationScienceLaboratories NipponTelegraphandTelephoneCorporation Translation Selection for Japanese-English Noun-Noun Compounds , .

[56]  Francisco Casacuberta,et al.  Statistical Phrase-Based Models for Interactive Computer-Assisted Translation , 2006, ACL.

[57]  Kevin Knight,et al.  Generation that Exploits Corpus-Based Statistical Knowledge , 1998, ACL.

[58]  Nizar Habash,et al.  Interlingua Approximation: A Generation-Heavy Approach , 2002 .

[59]  Daniel Marcu,et al.  The Importance of Lexicalized Syntax Models for Natural Language Generation Tasks , 2002, INLG.

[60]  Nizar Habash Syntactic preprocessing for statistical machine translation , 2007, MTSUMMIT.

[61]  Nizar Habash,et al.  DUSTer: a method for unraveling cross-language divergences for statistical word-level alignment , 2002, AMTA.

[62]  Mark Johnson,et al.  Joint and Conditional Estimation of Tagging and Parsing Models , 2001, ACL.

[63]  Jeff A. Bilmes,et al.  Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers , 2006, HLT-NAACL 2006.

[64]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[65]  Nizar Habash Oxygen: A Language Independent Linearization Engine , 2000, AMTA.

[66]  Anssi Yli-Jyrä,et al.  Finite-state methods and natural language processing : 5th International Workshop, FSMNLP 2005, Helsinki, Finland, September 1-2, 2005 : revised papers , 2006 .

[67]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[68]  Alexis Nasr,et al.  Parsing with Lexicalized Probabilistic Recursive Transition Networks , 2005, FSMNLP.

[69]  Nizar Habash,et al.  Hybrid Natural Language Generation from Lexical Conceptual Structures , 2003, Machine Translation.

[70]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[71]  Srinivas Bangalore,et al.  Corpus-Based Lexical Choice in Natural Language Generation , 2000, ACL.

[72]  Andy Way,et al.  Labelled Dependencies in Machine Translation Evaluation , 2007, WMT@ACL.

[73]  Irene Langkilde Forest-Based Statistical Sentence Generation , 2000, ANLP.

[74]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[75]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[76]  Nizar Habash,et al.  Generation-Heavy Hybrid Machine Translation , 2002, INLG.

[77]  Philip Resnik,et al.  The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’ , 1999, Comput. Humanit..

[78]  Hermann Ney,et al.  Towards the Use of Word Stems and Suffixes for Statistical Machine Translation , 2004, LREC.

[79]  Lluís Màrquez i Villodre,et al.  Linguistic Features for Automatic Evaluation of Heterogenous MT Systems , 2007, WMT@ACL.

[80]  Alexis Nasr,et al.  Enriching lexical transfer with cross-linguistic semantic features , 1997 .

[81]  John S. White,et al.  Envisioning Machine Translation in the Information Future , 2002, Lecture Notes in Computer Science.

[82]  Sadao Kurohashi,et al.  Finding Structural Correspondences from Bilingual Parsed Corpus for Corpus-based Translation , 2000, COLING.

[83]  Lynette Hirshman Overview of the DARPA Speech and Natural Language Workshop , 1989, HLT.

[84]  Abdelhadi Soudi,et al.  Challenges in the Generation of Arabic from Interlingua , 2004 .

[85]  Nizar Habash,et al.  Extracting a Tree Adjoining Grammar from the Penn Arabic Treebank , 2004 .

[86]  Nizar Habash,et al.  Handling translation divergences: combining statistical and symbolic techniques in generation-heavy machine translation , 2002, AMTA.

[87]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[88]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[89]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[90]  Bonnie J. Dorr,et al.  Machine Translation: A View from the Lexicon , 1994, CL.

[91]  Nizar Habash,et al.  Generation from Lexical Conceptual Structures , 2000 .

[92]  Hermann Ney,et al.  Chunk-Level Reordering of Source Language Sentences with Automatically Learned Rules for Statistical Machine Translation , 2007, SSST@HLT-NAACL.

[93]  Seth Kulick,et al.  Parsing the Arabic Treebank: Analysis and Improvements , 2006 .

[94]  Chris Quirk,et al.  Dependency Treelet Translation: Syntactically Informed Phrasal SMT , 2005, ACL.

[95]  Khaled Shaalan,et al.  A Proposed Approach for Generating Arabic from Interlingua in a Multilingual Machine Translation System , 2003 .

[96]  Nizar Habash,et al.  S EPIA : Surface Span Extension to Syntactic Dependency Precision-based MT Evaluation , 2008 .

[97]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[98]  Owen Rambow,et al.  Handling Stuctural Divergences and Recovering Dropped Arguments in a Korean/English Machine Translation System , 2000, AMTA.

[99]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[100]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[101]  Adwait Ratnaparkhi,et al.  Trainable Methods for Surface Natural Language Generation , 2000, ANLP.

[102]  Philipp Koehn,et al.  Selective addition of corpus-extracted phrasal lexical rules to a rule-based machine translation system , 2009, MTSUMMIT.

[103]  Daniel M. Bikel,et al.  Design of a multi-lingual, parallel-processing statistical parsing engine , 2002 .

[104]  Haytham Alsharaf,et al.  Problems and solutions in machine translation involving Arabic, Chinese and French , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[105]  Akira Shimazu,et al.  Improving phrase-based statistical machine translation with morphosyntactic transformation , 2006, Machine Translation.

[106]  Nizar Habash,et al.  A Categorial Variation Database for English , 2003, NAACL.

[107]  Ariadna Font Llitjós,et al.  A walk on the other side: adding statistical components to a transfer-based translation system , 2007, HLT-NAACL 2007.

[108]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[109]  Francesco Orilia,et al.  Semantics and Cognition , 1991 .

[110]  Jaime G. Carbonell,et al.  Context-Based Machine Translation , 2006, AMTA.

[111]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[112]  Nizar Habash,et al.  Arabic Morphological Representations for Machine Translation , 2007 .

[113]  Julia Aymerich,et al.  Generation of noun-noun compounds in the Spanish-English machine translation system SPANAM® , 2001, MTSUMMIT.

[114]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[115]  Pete Whitelock,et al.  Shake-and-Bake Translation , 1992, COLING.

[116]  Nizar Habash,et al.  Matador: a large-scale Spanish-English GHMT system , 2003, MTSUMMIT.

[117]  Michael White,et al.  Inducing Lexico-Structural Transfer Rules from Parsed Bi-texts , 2001, DDMMT@ACL.

[118]  Günter Neumann,et al.  Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[119]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[120]  Philipp Koehn,et al.  Machine Translation Summit XII , 2009 .