Simulating ASR errors for training SLU systems

This paper presents an approach to simulate automatic speech recognition (ASR) errors from manual transcriptions and describes how it can be used to improve the performance of spoken language understanding (SLU) systems. In particular, we point out that this noising process is very usefull to obtain a more robust SLU system to ASR errors in case of insufficient training data or more if ASR transcriptions are not available during the training of the SLU model. The proposed method is based on the use of both acoustic and linguistic word embeddings in order to define a similarity measure between words dedicated to predict ASR confusions. Actually, we assume that words acoustically and linguistically close are the ones confused by an ASR system. By using this similarity measure in order to randomly substitute correct words by potentially confusing words in manual annotations used to train CRF-or neural-based SLU systems, we augment the training corpus with these new noisy data. Experiments were carried on the French MEDIA corpus focusing on hotel reservation. They show that this approach significantly improves SLU system performance with a relative reduction of 21.2% of concept/value error rate (CVER), particularly when the SLU system is based on a neural approach (reduction of 22.4% of CVER). A comparison to a naive noising approach shows that the proposed noising approach is particularly relevant.

[1]  E. Fosler-Lussier,et al.  ON THE ROAD TO IMPROVED LEXICAL CONFUSABILITY METRICS , 2000 .

[2]  Renato De Mori,et al.  ASR Error Management for Improving Spoken Language Understanding , 2017, INTERSPEECH.

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  Sophie Rosset,et al.  Semantic annotation of the French media dialog corpus , 2005, INTERSPEECH.

[5]  Gökhan Tür,et al.  Multi-Domain Joint Semantic Frame Parsing Using Bi-Directional RNN-LSTM , 2016, INTERSPEECH.

[6]  Steve J. Young,et al.  Error simulation for training statistical dialogue systems , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[7]  Dilek Z. Hakkani-Tür,et al.  Spoken language understanding , 2008, IEEE Signal Processing Magazine.

[8]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Geoffrey E. Hinton,et al.  Application of Deep Belief Networks for Natural Language Understanding , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Guillaume Gravier,et al.  Is it time to Switch to word embedding and recurrent neural networks for spoken language understanding? , 2015, INTERSPEECH.

[12]  Paul Deléglise,et al.  LIUM and CRIM ASR System Combination for the REPERE Evaluation Campaign , 2014, TSD.

[13]  Hermann Ney,et al.  Comparing Stochastic Approaches to Spoken Language Understanding in Multiple Languages , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[15]  Benoît Favre,et al.  Word Embedding Evaluation and Combination , 2016, LREC.

[16]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[17]  Olivier Pietquin,et al.  Comparing ASR modeling methods for spoken dialogue simulation and optimal strategy learning , 2005, INTERSPEECH.

[18]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[19]  R. de Mori Spoken language understanding: a survey , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  Geoffrey Zweig,et al.  Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Fethi Bougares,et al.  NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems , 2017, Prague Bull. Math. Linguistics.

[23]  Eric Fosler-Lussier,et al.  Discriminative language modeling using simulated ASR errors , 2010, INTERSPEECH.

[24]  Georg Heigold,et al.  Word embeddings for speech recognition , 2014, INTERSPEECH.

[25]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[26]  Frédéric Béchet,et al.  On the use of finite state transducers for semantic interpretation , 2006, Speech Commun..