Complementary approaches to tree alignment. Combining statistical and rule-based methods

Grote verzamelingen van vertaalde teksten – zogenaamde parallelle corpora - worden vaak automatisch op zins- en woordniveau gealigneerd om automatische vertaalsystemen op te trainen. Soms voegt men ook automatisch syntactische bomen aan de zinnen toe om meer taalkundige informatie eruit te kunnen halen. Als die bomen aan beide kanten verschijnen en de boomknopen ook worden gealigneerd, is er sprake van een parallelle treebank. De beste vertaalsystemen zijn bijna of helemaal puur statistisch, maar in recente jaren ontstond er een grotere nadruk op de integratie van meer taalkundig gemotiveerde data, waaronder ook het gebruik van parallel treebanks. Ze zijn echter alleen op een zeer grote schaal bruikbaar, omdat er door zo een systeem veel te leren is van hoe een taal typisch naar een andere moet worden omgezet. Daarom onderzoeken we technieken om automatisch de boomknopen accuraat te aligneren. Een bijkomend motief is het feit dat parallel treebanks ook voor andere applicaties bruikbaar zijn en als taalbronnen zelf van wetenschappelijk belang zijn. Het hele proces van het aligneren van knopen noemen wij tree alignment. Wij vinden dat een combinatie van statistiche en regelgebaseerde technieken met relatief weinig trainingsgegevens en weinig features zeer accurate alignments kan produceren. Ten slotte vinden we dat, wanneer wij alignments die relatief heel veel knopen aligneren – al zijn sommigen soms fout – op een syntactisch gebaseerde systeem toepassen, dat tot verbeterde automatische vertaling leidt, in vergelijking met hetzelfde systeem die op minder maar meer accurate alignments getrained is.

[1]  Richard Edwin Stearns,et al.  Syntax-Directed Transduction , 1966, JACM.

[2]  Ralph Grishman,et al.  Alignment of Shared Forests for Bilingual Corpora , 1996, COLING.

[3]  Martin Volk,et al.  Phrase Alignment in Parallel Treebanks , 2006 .

[4]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[5]  Joakim Nivre,et al.  The English-Swedish-Turkish Parallel Treebank , 2010, LREC.

[6]  Bas Aarts,et al.  The Diachronic Corpus of Present-Day Spoken English (DCPSE) , 2006 .

[7]  Christian Boitet,et al.  Ambiguities and ambiguity labelling: Towards ambiguity data bases , 1997 .

[8]  Peter Norvig,et al.  Verbmobih A Translation System for Face-to-Face Dialog , 1994 .

[9]  Wolfgang Lezius,et al.  An XML-based Representation Format for Syntactically Annotated Corpora , 2000, LREC.

[10]  Yvette Graham Sulis: An Open Source Transfer Decoder for Deep Syntactic Statistical Machine Translation , 2010, Prague Bull. Math. Linguistics.

[11]  B. Harris Bi-text, a new concept in translation theory , 1988 .

[12]  Yang Liu,et al.  Discriminative Word Alignment by Linear Modeling , 2010, CL.

[13]  Ventsislav Zhechev,et al.  Automatic Generation of Parallel Treebanks: An Efficient Unsupervised System , 2010 .

[14]  Vincent Vandeghinste,et al.  Bottom-up Transfer in Example-based Machine Translation , 2010, EAMT.

[15]  Jörg Tiedemann,et al.  A Discriminative Approach to Tree Alignment , 2009 .

[16]  Eric Atwell,et al.  Syntactic Annotation Guidelines for the Quranic Arabic Dependency Treebank , 2010, LREC.

[17]  Yuji Matsumoto,et al.  Sructural Matching of Parallel Texts , 1993, ACL.

[18]  Lea Cyrus,et al.  Building a resource for studying translation shifts , 2006, LREC.

[19]  Yang Liu,et al.  Improving Tree-to-Tree Translation with Packed Forests , 2009, ACL.

[20]  Jason Eisner,et al.  Learning Non-Isomorphic Tree Mappings for Machine Translation , 2003, ACL.

[21]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[22]  Arul Menezes,et al.  A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora , 2001, DDMMT@ACL.

[23]  Bettina Schrader,et al.  Exploiting linguistic and statistical knowledge in a text alignment system , 2009 .

[24]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[25]  Nancy Ide American National Corpus (ANC) , 2002 .

[26]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[27]  Lieve Macken Sub-sentential alignment of translational correspondences , 2010 .

[28]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[29]  Ken Samuel,et al.  Dialogue Act Tagging with Transformation-Based Learning , 1998, ACL.

[30]  Hiroyuki Kaji,et al.  Learning Translation Templates From Bilingual Text , 1992, COLING.

[31]  Alexandra Kinyon,et al.  Building a Treebank for French , 2000, LREC.

[32]  Hiroshi Uchida Fujitsu machine translation system: ATLAS , 1986, Future Gener. Comput. Syst..

[33]  A. Lavie,et al.  Improving Syntax-Driven Translation Models by Re-structuring Divergent and Nonisomorphic Parse Tree Structures , 2008, AMTA.

[34]  Grace Ngai,et al.  Multidimensional transformation-based learning , 2001, CoNLL.

[35]  Kenji Imamura,et al.  Hierarchical Phrase Alignment Harmonized with Parsing , 2001, NLPRS.

[36]  Ken Williams,et al.  Learning Transformation Rules for Semantic Role Labeling , 2004, CoNLL.

[37]  Andreas Zollmann,et al.  Syntax Augmented Machine Translation via Chart Parsing , 2006, WMT@HLT-NAACL.

[38]  Daniel Marcu,et al.  Practical structured learning techniques for natural language processing , 2006 .

[39]  Eric Brill,et al.  Transformation-Based Error-Driven Parsing , 1993, IWPT.

[40]  Ralph Grishman,et al.  Deriving Transfer Rules from Dominance-Preserving Alignments , 1998, COLING-ACL.

[41]  D. Kok,et al.  Headline generation for Dutch newspaper articles through transformation-based learning , 2008 .

[42]  Jason S. Chang,et al.  A Class-based Approach to Word Alignment , 1997, CL.

[43]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[44]  Ying Zhang,et al.  Measuring confidence intervals for the machine translation evaluation metrics , 2004, TMI.

[45]  Kevin Knight,et al.  Tiburon: A Weighted Tree Automata Toolkit , 2006, CIAA.

[46]  Gertjan van Noord,et al.  Syntactic Annotation of Large Corpora in STEVIN , 2006, LREC.

[47]  Andy Way,et al.  Disambiguation Strategies for Data-Oriented Translation , 2006, EAMT.

[48]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[49]  Andy Way,et al.  Automatic Generation of Parallel Treebanks , 2008, COLING.

[50]  Mihaela Vela,et al.  Multi-dimensional Annotation and Alignment in an English-German Translation Corpus , 2006, NLPXML@EACL.

[51]  Andy Way,et al.  Robust language pair-independent sub-tree alignment , 2007, MTSUMMIT.

[52]  John Tinsley,et al.  Resourcing machine translation with parallel treebanks , 2009 .

[53]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[54]  Susumu Akamine,et al.  Multi-lingual Sentence Generation from the PIVOT Interlingua , 1991 .

[55]  Daniel Gildea,et al.  An Algorithm for Word-Level Alignment of Parallel Dependency Trees1 , 2003 .

[56]  Pascale Fung,et al.  A maximum-entropy chinese parser augmented by transformation-based learning , 2004, TALIP.

[57]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[58]  Gregory Crane,et al.  An Ownership Model of Annotation: The Ancient Greek Dependency Treebank , 2009 .

[59]  Stefan Riezler,et al.  Grammatical Machine Translation , 2006, NAACL.

[60]  Vincent Vandeghinste,et al.  A Hybrid Modular Machine Translation System , 2008 .

[61]  William J. Black,et al.  Language Independent Named Entity Classification by modified Transformation-based Learning and by Decision Tree Induction , 2002, CoNLL.

[62]  Daniel Gildea,et al.  Loosely Tree-Based Alignment for Machine Translation , 2003, ACL.

[63]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[64]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[65]  Yanjun Ma,et al.  Improving Word Alignment Using Syntactic Dependencies , 2008, SSST@ACL.

[66]  Andy Way,et al.  Supertagged Phrase-Based Statistical Machine Translation , 2007, ACL.

[67]  Stephan Oepen,et al.  Statistical Ranking in Tactical Generation , 2006, EMNLP.

[68]  Masaki Murata,et al.  Multilingual Aligned Parallel Treebank Corpus Reflecting Contextual Information and Its Applications , 2004 .

[69]  G. Kokkinakis,et al.  Handwritten character segmentation using transformation-based learning , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[70]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[71]  Daniel Marcu,et al.  What’s in a translation rule? , 2004, NAACL.

[72]  I. Dan Melamed,et al.  Empirical Lower Bounds on the Complexity of Translational Equivalence , 2006, ACL.

[73]  Eric Brill,et al.  A corpus-based approach to language learning , 1993 .

[74]  Tiejun Zhao,et al.  Automatic Translation Template Acquisition Based on Bilingual Structure Alignment , 2001, Int. J. Comput. Linguistics Chin. Lang. Process..

[75]  Eric Brill,et al.  Automatic Rule Acquisition for Spelling Correction , 1997, ICML.

[76]  Mitchell Marcus,et al.  Empirical Methods for Exploiting Parallel Texts , 2001 .

[77]  A.P.J. van den Bosch,et al.  Learning to pronounce written words : a study in inductive language learning , 1997 .

[78]  Lars Ahrenberg,et al.  LinES: An English-Swedish Parallel Treebank , 2007, NODALIDA.

[79]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[80]  M. T. Rosetta A compositional definition of the translation relation , 1994 .

[81]  Dekai Wu,et al.  Bracketing and aligning words and constituents in parallel text using Stochastic Inversion Transduction Grammars , 2000 .

[82]  Jörg Tiedemann Lingua-Align: An Experimental Toolbox for Automatic Tree-to-Tree Alignment , 2010, LREC.

[83]  Jan Hajic,et al.  Prague Czech-English Dependency Treebank. Syntactically Annotated Resources for Machine Translation , 2004, LREC.

[84]  Jörg Tiedemann,et al.  Building a Large Machine-Aligned Parallel Treebank , 2009 .

[85]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[86]  Wei Wang,et al.  Structure Alignment Using Bilingual Chunking , 2002, COLING.

[87]  Induction of Probabilistic Synchronous Tree-Insertion Grammars for Machine Translation , 2006, AMTA.

[88]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[89]  Vincent Vandeghinste,et al.  Tree-Based Target Language Modeling , 2009, EAMT.

[90]  Alon Lavie,et al.  Syntax-Driven Learning of Sub-Sentential Translation Equivalents and Translation Rules from Parsed Parallel Corpora , 2008, SSST@ACL.

[91]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[92]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[93]  Ralph Grishman Iterative Alignment of Syntactic Structures for a Bilingual Corpus , 1999 .

[94]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[95]  Martin Volk,et al.  Using the Stockholm TreeAligner , 2007 .

[96]  Vincent Vandeghinste,et al.  Top-down Transfer in Example-based MT , 2009 .

[97]  Sidney Greenbaum,et al.  The International Corpus of English (ICE) Project , 1996 .

[98]  Hermann Ney One decade of statistical machine translation: 1996-2005 , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[99]  Alon Lavie,et al.  MT for Minority Languages Using Elicitation-Based Learning of Syntactic Transfer Rules , 2002, Machine Translation.

[100]  Alexander M. Fraser,et al.  Squibs and Discussions: Measuring Word Alignment Quality for Statistical Machine Translation , 2007, CL.

[101]  Makoto Nagao,et al.  A framework of a mechanical translation between Japanese and English by analogy principle , 1984 .

[102]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[103]  Daniel Marcu,et al.  Re-structuring, Re-labeling, and Re-aligning for Syntax-Based Machine Translation , 2010, CL.

[104]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[105]  Eric Brill,et al.  Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach , 1993, ACL.

[106]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[107]  Jörg Tiedemann Word to word alignment strategies , 2004, COLING.

[108]  Derrick Higgins,et al.  A transformation-based approach to argument labeling , 2004, CoNLL.

[109]  Thomas G. Dietterich Machine Learning for Sequential Data: A Review , 2002, SSPR/SPR.

[110]  Gertjan van Noord,et al.  At Last Parsing Is Now Operational , 2006, JEPTALNRECITAL.

[111]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[112]  Kyung Sup Kwak,et al.  A discourse based approach in text-based machine translation , 2009, ArXiv.

[113]  Magnus Merkel,et al.  A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts , 1998, ACL.

[114]  Ulf Hermjakob,et al.  Improved Word Alignment with Statistics and Linguistic Heuristics , 2009, EMNLP.

[115]  Rens Bod,et al.  A Computational Model of Language Performance: Data Oriented Parsing , 1992, COLING.

[116]  Andy Way,et al.  Capturing translational divergences with a statistical tree-to-tree aligner , 2007 .

[117]  Jörg Tiedemann Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing , 2003 .

[118]  Josef van Genabith,et al.  Dependency-Based N-Gram Models for General Purpose Sentence Realisation , 2008, COLING.

[119]  Martin Volk,et al.  A Quechua-Spanish parallel treebank , 2008 .

[120]  Arjen Poutsma Machine translation with Tree-DOP , 2007 .

[121]  Robert Hanbury Brown,et al.  Seeing the wood for the trees , 1991 .

[122]  Christof Monz,et al.  Alignment Link Projection Using Transformation-Based Learning , 2005, HLT/EMNLP.

[123]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[124]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[125]  Arjen Poutsma Data-Oriented Translation , 2000, COLING.

[126]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[127]  Grace Ngai,et al.  Transformation Based Learning in the Fast Lane , 2001, NAACL.

[128]  Ken Samuel,et al.  Lazy Transformation-Based Learning , 1998, FLAIRS.

[129]  Hal Daumé Notes on CG and LM-BFGS Optimization of Logistic Regression , 2008 .

[130]  Alfred V. Aho,et al.  Syntax Directed Translations and the Pushdown Assembler , 1969, J. Comput. Syst. Sci..

[131]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[132]  Daniel Marcu,et al.  SPMT: Statistical Machine Translation with Syntactified Target Language Phrases , 2006, EMNLP.

[133]  Heidi Fox,et al.  Phrasal Cohesion and Statistical Machine Translation , 2002, EMNLP.

[134]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[135]  Martin Volk,et al.  Alignment Tools for Parallel Treebanks , 2007 .

[136]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[137]  菅山 謙正,et al.  Word Grammar 理論の研究 , 2005 .

[138]  Chris Callison-Burch,et al.  Demonstration of Joshua: An Open Source Toolkit for Parsing-based Machine Translation , 2009, ACL.

[139]  Josef van Genabith,et al.  Deep Syntax Language Models and Statistical Machine Translation , 2010, SSST@COLING.

[140]  Andy Way,et al.  Robust Sub-Sentential Alignment of Phrase-Structure Trees , 2004, COLING.

[141]  Martin Volk,et al.  Automatic Phrase Alignment: Using Statistical N-Gram Alignment for Syntactic Phrase Alignment , 2007 .

[142]  David Chiang,et al.  An Introduction to Synchronous Grammars , 2006 .

[143]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[144]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[145]  Vincent Vandeghinste Removing the distinction between a translation memory, a bilingual dictionary and a parallel corpus , 2007 .

[146]  Jun Sun,et al.  Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels , 2010, ACL.

[147]  Josef van Genabith,et al.  Factor templates for factored machine translation models , 2010, IWSLT.

[148]  John C. Henderson,et al.  Coaxing Confidences from an Old Freind: Probabilistic Classifications from Transformation Rule Lists , 2000, EMNLP.