Phoneme based Domain Prediction for Language Model Adaptation

Automatic Speech Recognizer (ASR) and Natural Language Understanding (NLU) are the two key components for any voice assistant. ASR converts the input audio signal to text using acoustic model (AM), language model (LM) and Decoder. NLU further processes this text for sub-tasks like predicting domain, intent and slots. Since input to NLU is text, any error in ASR module will propagate in NLU sub-tasks. ASR generally process speech in small duration windows and first generates phonemes using Acoustic Model (AM) and then Word Lattices using Decoder, Dictionary and Language Model (LM). Training and maintaining a generic LM, which fits the distribution of data of multiple domains is a difficult task. So our proposed architecture uses multiple domain specific LMs to rescore word lattice and has a way to select LMs for rescoring. In this paper, we are proposing a novel Multistage CNN architecture to classify the domain from partial phoneme sequence and use it to select top K domain LMs. The accuracy of multistage classification model based on phoneme input for top three domains has achieved stateof-the-art results on 2 open datasets, 97.76% in ATIS and 99.57% in Snips.

[1]  Hassan Ouahmane,et al.  Towards a generic approach for automatic speech recognition error detection and classification , 2018, 2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP).

[2]  Anmol Bhasin,et al.  Unified Parallel Intent and Slot Prediction with Cross Fusion and Slot Masking , 2019, NLDB.

[3]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data , 2017, Lecture Notes in Computer Science.

[5]  He Tingting,et al.  Attention-Based CNN-BLSTM Networks for Joint Intent Detection and Slot Filling , 2018 .

[6]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[7]  Haoqi Li,et al.  Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling , 2018, APSIPA Transactions on Signal and Information Processing.

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  David DeVault,et al.  Incremental Dialogue Understanding and Feedback for Multiparty, Multimodal Conversation , 2012, IVA.

[12]  Gökhan Tür,et al.  Multi-Domain Joint Semantic Frame Parsing Using Bi-Directional RNN-LSTM , 2016, INTERSPEECH.

[13]  Yongqiang Wang,et al.  Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Gökhan Tür,et al.  End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding , 2016, INTERSPEECH.

[15]  Hassan Ouahmane,et al.  Improving ASR Error Detection with RNNLM Adaptation , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[16]  Youssef Bassil,et al.  Post-Editing Error Correction Algorithm for Speech Recognition using Bing Spelling Suggestion , 2012, ArXiv.

[17]  Li Tang,et al.  Attention-Based CNN-BLSTM Networks for Joint Intent Detection and Slot Filling , 2018, CCL.

[18]  Bing Liu,et al.  Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling , 2016, INTERSPEECH.

[19]  David DeVault,et al.  Towards Natural Language Understanding of Partial Speech Recognition Results in Dialogue Systems , 2009, HLT-NAACL.

[20]  Chih-Li Huo,et al.  Slot-Gated Modeling for Joint Slot Filling and Intent Prediction , 2018, NAACL.

[21]  Hassan Ouahmane,et al.  System-independent ASR error detection and classification using Recurrent Neural Network , 2019, Comput. Speech Lang..

[22]  Jithendra Vepa,et al.  Speech Emotion Recognition Using Spectrogram & Phoneme Embedding , 2018, INTERSPEECH.

[23]  Yoshua Bengio,et al.  Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[24]  Tara N. Sainath,et al.  A Spelling Correction Model for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Rafael E. Banchs,et al.  Automatic Correction of ASR Outputs by Using Machine Translation , 2016, INTERSPEECH.