Study and Comparison of Rule-Based and Statistical Catalan-Spanish Machine Translation Systems

Machine translation systems can be classified into rule-based and corpus-based approaches, in terms of their core methodology. Since both paradigms have been largely used during the last years, one of the aims in the research community is to know how these systems differ in terms of translation quality. To this end, this paper reports a study and comparison of several specific Catalan-Spanish machine translation systems: two rule-based and two corpus-based (particularly, statistical-based) systems, all of them freely available on the web. The translation quality analysis is performed under two different domains: journalistic and medical. The systems are evaluated by using standard automatic measures, as well as by native human evaluators. In addition to these traditional evaluation procedures, this paper reports a novel linguistic evaluation, which provides information about the errors encountered at the orthographic, morphological, lexical, semantic and syntactic levels. Results show that while rule-based systems provide a better performance at orthographic and morphological levels, statistical systems tend to commit less semantic errors. Furthermore, results show all the evaluations performed are characterised by some degree of correlation, and human evaluators tend to be specially critical with semantic and syntactic errors.

[1]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[2]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[3]  Antoni Oliver,et al.  Traducción automática estadística basada en n-gramas , 2005, Proces. del Leng. Natural.

[4]  Rafael E. Banchs,et al.  Discriminative Alignment Training without Annotated Data for Machine Translation , 2007, HLT-NAACL.

[5]  Bonnie J. Dorr,et al.  Machine Translation Divergences: A Formal Description and Proposed Solution , 1994, CL.

[6]  J DorrBonnie Machine translation divergences , 1994 .

[7]  Philipp Koehn,et al.  Proceedings of the Fourth Workshop on Statistical Machine Translation, WMT@EACL 2009, Athens, Greece, March 30-31, 2009 , 2009, WMT@EACL.

[8]  Marta R. Costa-jussà,et al.  An Ngram-based reordering model , 2009, Comput. Speech Lang..

[9]  Srinivas Bangalore,et al.  Finite-state models for lexical reordering in spoken language translation , 2000, INTERSPEECH.

[10]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[11]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[12]  Philippe Langlais,et al.  Translating Unknown Words by Analogical Learning , 2007, EMNLP.

[13]  José B. Mariño,et al.  N-gram-based Machine Translation , 2006, CL.

[14]  José Clemente Architecture and modeling for n-gram-based statistical machine translation , 2008 .

[15]  John D. Lafferty,et al.  The Candide System for Machine Translation , 1994, HLT.

[16]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[17]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[18]  Hermann Ney,et al.  Chunk-Level Reordering of Source Language Sentences with Automatically Learned Rules for Statistical Machine Translation , 2007, SSST@HLT-NAACL.

[19]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[20]  Philipp Koehn,et al.  Findings of the 2009 Workshop on Statistical Machine Translation , 2009, WMT@EACL.

[21]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[22]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[23]  José B. Mariño,et al.  System Combination for Machine Translation of Spoken and Written Language , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[25]  Francisco Casacuberta Finite-state transducers for speech-input translation , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[26]  Doug Arnold,et al.  Machine Translation: An Introductory Guide , 1994 .

[27]  Hervé Bourlard,et al.  On the Use of Information Retrieval Measures for Speech Recognition Evaluation , 2004 .

[28]  Enrique Vidal,et al.  Finite-state speech-to-speech translation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[30]  Hermann Ney,et al.  Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information , 2004, CL.

[31]  Hermann Ney,et al.  Phrase-Based Statistical Machine Translation , 2002, KI.

[32]  José B. Mariño,et al.  Using x-grams for speech-to-speech translation , 2002, INTERSPEECH.

[33]  Hermann Ney,et al.  Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation , 2003, CL.