Introduction of Semantic Model to Help Speech Recognition

Current Automatic Speech Recognition (ASR) systems mainly take into account acoustic, lexical and local syntactic information. Long term semantic relations are not used. ASR systems significantly decrease performance when the training conditions and the testing conditions differ due to the noise, etc. In this case the acoustic information can be less reliable. To help noisy ASR system, we propose to supplement ASR system with a semantic module. This module re-evaluates the N-best speech recognition hypothesis list and can be seen as a form of adaptation in the context of noise. For the words in the processed sentence that could have been poorly recognized, this module chooses words that correspond better to the semantic context of the sentence. To achieve this, we introduced the notions of a context part and possibility zones that measure the similarity between the semantic context of the document and the corresponding possible hypothesis. The proposed methodology uses two continuous representations of words: word2vec and FastText. We conduct experiments on the publicly available TED conferences dataset (TED-LIUM) mixed with real noise. The proposed method achieves a significant improvement of the word error rate (WER) over the ASR system without semantic information.

[1]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[2]  Georges Linarès,et al.  Modelling Semantic Context of OOV Words in Large Vocabulary Continuous Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  VargaAndrew,et al.  Assessment for automatic speech recognition II , 1993 .

[4]  Yannick Estève,et al.  TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation , 2018, SPECOM.

[5]  Kyomin Jung,et al.  Effective Sentence Scoring Method Using BERT for Speech Recognition , 2019, ACML.

[6]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[7]  Tomohiro Nakatani,et al.  Rescoring N-Best Speech Recognition List Based on One-on-One Hypothesis Comparison Using Encoder-Classifier Model , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  David R. Traum,et al.  A reranking approach for recognition and classification of speech input in conversational dialogue systems , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[9]  Raymond Chi-Wing Wong,et al.  L2RS: A Learning-to-Rescore Mechanism for Automatic Speech Recognition , 2019, ArXiv.

[10]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Giuseppe Riccardi,et al.  Semantic language models for Automatic Speech Recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[13]  Björn W. Schuller,et al.  Deep Learning for Environmentally Robust Speech Recognition , 2017, ACM Trans. Intell. Syst. Technol..

[14]  Michael Picheny,et al.  Using semantic analysis to improve speech recognition performance , 2005, Comput. Speech Lang..

[15]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[16]  Jean C. Walrand,et al.  ACM TIST Special Issue on Data-Driven Intelligence for Wireless Networking , 2017, ACM Trans. Intell. Syst. Technol..

[17]  Philipp Cimiano,et al.  Semantic parsing of speech using grammars learned with weak supervision , 2015, HLT-NAACL.

[18]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[19]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Raymond J. Mooney,et al.  Improving Black-box Speech Recognition using Semantic Parsing , 2017, IJCNLP 2017.