South-East European Times : A parallel corpus of Balkan languages , Francis Tyers and

This paper describes a series of machine translation experiments with the English-Romanian language pair. The experiments were intended to test and prove the hypothesis that syntactically motivated long translation examples added to a base-line 3gram statistically extracted phrase table improves the translation performance in terms of the score BLEU. Extensive tests with a couple of different scenarios were performed: 1) simply concatenating the “extra” translations example to the baseline phrase-table; 2) computing and taking into account perplexities for the POS-string associated to the translation examples; 3) taking into account the number of words in each member of a translation example; 4) filtering the “extra” translation examples by taking into account a score that appreciates the correctness of their lexical alignment. Different combinations of the four scenarios were also tested. Also, the paper presents a method for extracting syntactically motivated translation examples using the dependency linkage of both the source and target sentence. To decompose the source/target sentence into fragments, we identified two types of dependency link-structures super-links and chains and used these structures to set the translation example borders.

[1]  Bálint Sass,et al.  FDVC: creating a corpus-driven frequency dictionary of verb phrase constructions for Hungarian , 2010 .

[2]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[3]  Rada Mihalcea,et al.  Word Sense Disambiguation , 2015, Encyclopedia of Machine Learning.

[4]  Francis M. Tyers,et al.  Shallow-transfer rule-based machine translation for Swedish to Danish , 2009, FREEOPMT.

[5]  Keh-Jiann Chen,et al.  Acquiring Translation Equivalences of Multiword Expressions by Normalized Correlation Frequencies , 2009, EMNLP.

[6]  Bálint Sass Verb Argument Browser for Danish , 2009, NODALIDA.

[7]  Violeta Seretan,et al.  A Tool for Multi-Word Expression Extraction in Modern Greek Using Syntactic Parsing , 2009, EACL.

[8]  Sara Stymne A Comparison of Merging Strategies for Translation of German Compounds , 2009, EACL.

[9]  Philipp Koehn,et al.  Findings of the 2009 Workshop on Statistical Machine Translation , 2009, WMT@EACL.

[10]  Aline Villavicencio,et al.  Multi-word terminology extraction for domain-specific documents , 2009 .

[11]  A. Teischinger,et al.  BUILDING LANGUAGE RESOURCES AND TRANSLATION MODELS FOR MACHINE TRANSLATION FOCUSED ON SOUTH SLAVIC AND BALKAN LANGUAGES , 2008 .

[12]  Tomaž Erjavec,et al.  MuLTILINguAL RESOuRcES, TEcHNOLOgIES ANd EvALuATION fOR cENTRAL ANd EASTERN EuROPEAN LANguAgES , 2009 .

[13]  Carlos Ramisch,et al.  Picking them up and Figuring them out: Verb-Particle Constructions, Noise and Idiomaticity , 2008, CoNLL.

[14]  Philipp Koehn,et al.  Enriching Morphologically Poor Languages for Statistical Machine Translation , 2008, ACL.

[15]  Thierry Poibeau,et al.  LexSchem: a Large Subcategorization Lexicon for French Verbs , 2008, LREC.

[16]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[17]  Ron Artstein Inter-Coder Agreement for Computational Linguistics , 2008 .

[18]  Violeta Seretan,et al.  Collocation extraction based on syntactic parsing , 2008 .

[19]  Dan Tufis,et al.  DIAC+: a Professional Diacritics Recovering System , 2008, LREC.

[20]  János Csirik,et al.  Hungarian Word-Sense Disambiguated Corpus , 2008, LREC.

[21]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[22]  Dan Tufis,et al.  Ontology-Supported Text Classification Based on Cross-Lingual Word Sense Disambiguation , 2007, WILF.

[23]  Ted Briscoe,et al.  A System for Large-Scale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora , 2007, ACL.

[24]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[25]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[26]  Carlos Ramisch,et al.  Validation and Evaluation of Automatically Acquired Multiword Expressions for Grammar Engineering , 2007, EMNLP.

[27]  Károly Varasdi,et al.  Hungarian WordNet and representation of verbal event structure , 2007, Acta Cybern..

[28]  Ulrich Heid,et al.  Vers un dictionnaire de collocations multilingue , 2007 .

[29]  János Csirik,et al.  Methods and results of the Hungarian WordNet project , 2007 .

[30]  Tamás Magay,et al.  Magyar-angol kéziszótár = A concise Hungarian-English dictionary , 2007 .

[31]  Mathias Creutz,et al.  Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner , 2007, MTSUMMIT.

[32]  Andy Way,et al.  Comparing rule-based and data-driven approaches to Spanish-to-Basque machine translation , 2007, MTSUMMIT.

[33]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[34]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[35]  Dan Tufis,et al.  Acquis Communautaire Sentence Alignment using Support Vector Machines , 2006, LREC.

[36]  Dan Tufis,et al.  Improved Lexical Alignment by Combining Multiple Reified Alignments , 2006, EACL.

[37]  Stefan Evert,et al.  Using small random samples for the manual evaluation of statistical association measures , 2005, Comput. Speech Lang..

[38]  Timothy Baldwin,et al.  Deep lexical acquisition of verb-particle constructions , 2005, Comput. Speech Lang..

[39]  Preslav Nakov,et al.  Search Engine Statistics Beyond the n-Gram: Application to Noun Compound Bracketing , 2005, CoNLL.

[40]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[41]  Martha Palmer,et al.  Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[42]  H. Dang,et al.  Making fine-grained and coarse-grained sense distinctions, both manually and automatically , 2006, Natural Language Engineering.

[43]  Vladislav Kubon,et al.  A translation model for languages of accessing countries , 2004, EAMT.

[44]  Anthony McEnery,et al.  Corpus Linguistics by the Lune: A Festschrift for Geoffrey Leech , 2003 .

[45]  Gaël Dias,et al.  Multiword Unit Hybrid Extraction , 2003, ACL 2003.

[46]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[47]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[48]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[49]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[50]  Ralph Grishman,et al.  Towards Best Practice for Multiword Expressions in Computational Lexicons , 2002, LREC.

[51]  Tamás Váradi,et al.  The Hungarian National Corpus , 2002, LREC.

[52]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[53]  Jouko Lindstedt Linguistic Balkanization: Contact-Induced Change By Mutual Reinforcement , 2000, Languages in Contact.

[54]  Stelios Piperidis,et al.  A Unified POS Tagging Architecture and its Application to Greek , 2000, LREC.

[55]  Adam Kilgarriff,et al.  95% Replicability for Manual Word Sense Tagging , 1999, EACL.

[56]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[57]  Janyce Wiebe,et al.  Word-Sense Distinguishability and Inter-Coder Agreement , 1998, EMNLP.

[58]  James Pustejovsky,et al.  Corelex: systematic polysemy and underspecification , 1998 .

[59]  Ray Jackendoff TWISTIN' THE NIGHT AWAY , 1997 .

[60]  Argyro Moustaki,et al.  Les expressions figées être prép C W en grec moderne , 1995 .

[61]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[62]  Aggeliki Fotopoulou,et al.  Une classification des phrases à compléments figés en grec moderne. Etude morphosyntaxique des phrases figées: thèse de doctorat soutenue à l'Université Paris VIII le 26 février 1993 : rèsumé de l'auteur , 1993 .

[63]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[64]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .