A conjoint analysis framework for evaluating user preferences in machine translation

Despite much research on machine translation (MT) evaluation, there is surprisingly little work that directly measures users’ intuitive or emotional preferences regarding different types of MT errors. However, the elicitation and modeling of user preferences is an important prerequisite for research on user adaptation and customization of MT engines. In this paper we explore the use of conjoint analysis as a formal quantitative framework to assess users’ relative preferences for different types of translation errors. We apply our approach to the analysis of MT output from translating public health documents from English into Spanish. Our results indicate that word order errors are clearly the most dispreferred error type, followed by word sense, morphological, and function word errors. The conjoint analysis-based model is able to predict user preferences more accurately than a baseline model that chooses the translation with the fewest errors overall. Additionally we analyze the effect of using a crowd-sourced respondent population versus a sample of domain experts and observe that main preference effects are remarkably stable across the two samples.

[1]  D. McFadden Conditional logit analysis of qualitative choice behavior , 1972 .

[2]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[3]  L. A. Goodman Exploratory latent structure analysis using both identifiable and unidentifiable models , 1974 .

[4]  Kathryn A Phillips,et al.  Measuring preferences for health care interventions using conjoint analysis: an application to HIV testing. , 2002, Health services research.

[5]  Hermann Ney,et al.  Error Analysis of Statistical Machine Translation Output , 2006, LREC.

[6]  Sharon O'Brien Cognitive explorations of translation , 2011 .

[7]  Jon Doyle,et al.  Background to Qualitative Decision Theory , 1999, AI Mag..

[8]  Kathleen McKeown,et al.  MT Error Detection for Cross-Lingual Question Answering , 2010, COLING.

[9]  Amittai Axelrod,et al.  Application of statistical machine translation to public health information: a feasibility study , 2011, J. Am. Medical Informatics Assoc..

[10]  Alon Lavie,et al.  Choosing the Right Evaluation for Machine Translation: an Examination of Annotator and Automatic Metric Performance on Human Judgment Tasks , 2010, AMTA.

[11]  Jordan J. Louviere,et al.  Design and Analysis of Simulated Consumer Choice or Allocation Experiments: An Approach Based on Aggregate Data , 1983 .

[12]  Li Chen,et al.  Survey of Preference Elicitation Methods , 2004 .

[13]  PAPER,et al.  Conditional Logit , IIA , and Alternatives for Estimating Models of Interstate Migration , 2007 .

[14]  Mark Sanderson,et al.  The affect of machine translation on the performance of Arabic-English QA system , 2006 .

[15]  Lucia Specia,et al.  Machine translation evaluation versus quality estimation , 2010, Machine Translation.

[16]  Bowen Hui Measuring User Acceptability of Machine Translations to Diagnose System Errors: An Experience Report , 2002, COLING 2002.

[17]  Hermann Ney,et al.  Towards Automatic Error Analysis of Machine Translation Output , 2011, CL.

[18]  Nizar Habash,et al.  Can Automatic Post-Editing Make MT More Meaningful , 2012, EAMT.

[19]  Douglas G. Altman,et al.  Practical statistics for medical research , 1990 .

[20]  Toru Ishida,et al.  Effects of machine translation on collaborative work , 2006, CSCW '06.

[21]  Mark Sanderson,et al.  The Effect of Machine Translation on the Performance of Arabic-EnglishQA System , 2006, Workshop On Multilingual Question Answering MLQA.

[22]  Katrin Kirchhoff,et al.  Evaluating User Preferences in Machine Translation Using Conjoint Analysis , 2012, EAMT.

[23]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[24]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[25]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[26]  P. Green,et al.  Conjoint Analysis in Consumer Research: Issues and Outlook , 1978 .

[27]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[28]  Marwan Awad,et al.  Evaluation of Machine Translation Errors in English and Iraqi Arabic , 2010, LREC.

[29]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[30]  Marta R. Costa-jussà,et al.  Study and correlation analysis of linguistic, perceptual, and automatic machine translation evaluations , 2012, J. Assoc. Inf. Sci. Technol..

[31]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[32]  Darius Braziunas,et al.  Computational Approaches to Preference Elicitation , 2006 .

[33]  Edward M. Bergman,et al.  Modelling preferences and stability among transport alternatives , 2002 .

[34]  Hans P. Krings,et al.  Repairing Texts: Empirical Investigations of Machine Translation Post-Editing Processes , 2001 .

[35]  Tjalling J. Ypma,et al.  Historical Development of the Newton-Raphson Method , 1995, SIAM Rev..

[36]  Vithala R. Rao,et al.  Conjoint Measurement- for Quantifying Judgmental Data , 1971 .

[37]  J. Doyle,et al.  Background to qualitative decision theory : Special articles on Bayesian techniques , 1999 .

[38]  P. Zarembka Frontiers in econometrics , 1973 .

[39]  Craig Boutilier,et al.  A Constraint-Based Approach to Preference Elicitation and Decision Making , 1997 .

[40]  T. L. Saaty A Scaling Method for Priorities in Hierarchical Structures , 1977 .