Combination of stochastic understanding and machine translation systems for language portability of dialogue systems

In this paper, several approaches for language portability of dialogue systems are investigated with a focus on the spoken language understanding (SLU) component. We show that the use of statistical machine translation (SMT) can greatly reduce the time and cost of porting an existing system from a source to a target language. Using automatically translated training data we study phrase-based machine translation as an alternative to conditional random fields for conceptual decoding to compensate for the loss of a precise concept-word alignment. Also two ways to increase SLU robustness to translation errors (smeared training data and translation post-editing) are shown to improve performance when test data are translated then decoded in the source language. Overall the combination of all these approaches allows to reduce even further the concept error rate. Experiments were carried out on the French MEDIA dialogue corpus with a subset manually translated into Italian.

[1]  Frédéric Béchet,et al.  On the use of machine translation for spoken language understanding portability , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Giuseppe Riccardi,et al.  Generative and discriminative algorithms for spoken language understanding , 2007, INTERSPEECH.

[3]  Sophie Rosset,et al.  Semantic annotation of the French media dialog corpus , 2005, INTERSPEECH.

[4]  Hermann Ney,et al.  A Comparison of Various Methods for Concept Tagging for Spoken Language Understanding , 2008, LREC.

[5]  David Suendermann-Oeft,et al.  Localization of speech recognition in spoken dialog systems: how machine translation can make our lives easier , 2009, INTERSPEECH.

[6]  David Suendermann-Oeft,et al.  From rule-based to statistical grammars: Continuous improvement of large-scale spoken dialog systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Michel Simard,et al.  Statistical Phrase-Based Post-Editing , 2007, NAACL.

[8]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[9]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[10]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[11]  Gorka Labaka,et al.  Statistical Post-Editing : A Valuable Method in Domain Adaptation of RBMT Systems for Less-Resourced Languages , 2008 .

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  Fabrice Lefèvre,et al.  Investigating multiple approaches for SLU portability to a new language , 2010, INTERSPEECH.

[14]  Fabrice Lefèvre,et al.  Cross-lingual spoken language understanding from unaligned data using discriminative classification models and machine translation , 2010, INTERSPEECH.