Modelling Semantic Context of OOV Words in Large Vocabulary Continuous Speech Recognition

The diachronic nature of broadcast news data leads to the problem of out-of-vocabulary (OOV) words in large vocabulary continuous speech recognition (LVCSR) systems. Analysis of OOV words reveals that a majority of them are proper names (PNs). However, PNs are important for automatic indexing of audio–video content and for obtaining reliable automatic transcriptions. In this paper, we focus on the problem of OOV PNs in diachronic audio documents. To enable the recovery of the PNs missed by the LVCSR system, relevant OOV PNs are retrieved by exploiting the semantic context of the LVCSR transcriptions. For retrieval of OOV PNs, we explore topic and semantic context derived from latent Dirichlet allocation (LDA) topic models, continuous word vector representations and the neural bag-of-words (NBOW) model which is capable of learning task specific word and context representations. We propose a neural bag-of-weighted words (NBOW2) model which learns to assign higher weights to words that are important for retrieval of an OOV PN. With experiments on French broadcast news videos, we show that the NBOW and NBOW2 models outperform the methods based on raw embeddings from LDA and Skip-gram models. Combining the NBOW and NBOW2 models gives a faster convergence during training. Second pass speech recognition experiments, in which the LVCSR vocabulary and language model are updated with the retrieved OOV PNs, demonstrate the effectiveness of the proposed context models.

[1]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[2]  Pascale Sébillot,et al.  Automatically finding semantically consistent n-grams to add new words in LVCSR systems , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[4]  Georges Linarès,et al.  Person name recognition in ASR outputs using continuous context models , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Gang Li,et al.  Vocabulary and language model adaptation using just one speech file , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Georges Linarès,et al.  Exploring temporal context in diachronic text documents for automatic OOV proper name retrieval , 2013, LTC 2013.

[7]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[8]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[9]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[10]  Georges Linarès,et al.  OOV Proper Name retrieval using topic and lexical context models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[12]  Alexander I. Rudnicky,et al.  Learning OOV through semantic relatedness in spoken dialog systems , 2015, INTERSPEECH.

[13]  Long Qin,et al.  Learning Out-of-Vocabulary Words in Automatic Speech Recognition , 2013 .

[14]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[15]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[16]  Luděk Müller,et al.  Language Model Adaptation Using Different Class-Based Models , 2007 .

[17]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[18]  Hermann Ney,et al.  Improved strategies for a zero oov rate LVCSR system , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Irina Illina,et al.  Study of entity-topic models for OOV proper name retrieval , 2015, INTERSPEECH.

[20]  Ole Winther,et al.  Convolutional LSTM Networks for Subcellular Localization of Proteins , 2015, AlCoB.

[21]  Frank Seide,et al.  Online vocabulary adaptation using limited adaptation data , 2007, INTERSPEECH.

[22]  Georges Linarès,et al.  Learning to retrieve out-of-vocabulary words in speech recognition , 2015, ArXiv.

[23]  Stephanie Seneff,et al.  Two-pass strategy for handling OOVs in a large vocabulary recognition task , 2005, INTERSPEECH.

[24]  Ian R. Lane,et al.  Unsupervised vocabulary selection for real-time speech recognition of lectures , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[26]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[27]  Ming Zhou,et al.  Adaptive Recursive Neural Network for Target-dependent Twitter Sentiment Classification , 2014, ACL.

[28]  Georges Linarès,et al.  Document level semantic context for retrieving OOV proper names , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[30]  Alexandre Allauzen,et al.  Diachronic vocabulary adaptation for broadcast news transcription , 2005, INTERSPEECH.

[31]  Hermann Ney,et al.  Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[32]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[33]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[34]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Georges Linarès,et al.  Person name spotting by combining acoustic matching and LDA topic models , 2013, INTERSPEECH.

[37]  Tatsuya Kawahara,et al.  Trigger-Based Language Model Adaptation for Automatic Transcription of Panel Discussions , 2006, IEICE Trans. Inf. Syst..

[38]  Alexandre Allauzen,et al.  Open vocabulary ASR for audiovisual document indexation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[39]  Phil Blunsom,et al.  The Role of Syntax in Vector Space Models of Compositional Semantics , 2013, ACL.

[40]  Alexander I. Rudnicky,et al.  OOV Word Detection using Hybrid Models with Mixed Types of Fragments , 2012, INTERSPEECH.

[41]  Georges Linarès,et al.  Improved Neural Bag-of-Words Model to Retrieve Out-of-Vocabulary Words in Speech Recognition , 2016, INTERSPEECH.

[42]  Mark Dredze,et al.  A spoken term detection framework for recovering out-of-vocabulary words using the web , 2010, INTERSPEECH.

[43]  Peng Wang,et al.  Semantic Clustering and Convolutional Neural Network for Short Text Categorization , 2015, ACL.

[44]  Wei Chen,et al.  Variable-Span out-of-vocabulary named entity detection , 2013, INTERSPEECH.

[45]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[46]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[47]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[48]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[49]  Irina Illina,et al.  How Diachronic Text Corpora Affect Context based Retrieval of OOV Proper Names for Audio News , 2016, LREC.

[50]  Irina Illina,et al.  The automatic news transcription system: ANTS, some real time experiments , 2004, INTERSPEECH.

[51]  Ciro Martins,et al.  Dynamic language modeling for a daily broadcast news transcription system , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[52]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[53]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[54]  Luc Van Gool,et al.  Detection and Identification of Rare Audiovisual Cues , 2012, Studies in Computational Intelligence.

[55]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[56]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[57]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[58]  Denis Jouvet,et al.  Adding New Words into a Language Model using Parameters of Known Words with Similar Behavior , 2015, ICNLSP.

[59]  Denis Jouvet,et al.  A Machine Learning Based Approach for Vocabulary Selection for Speech Transcription , 2013, TSD.

[60]  Ye Zhang,et al.  A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification , 2015, IJCNLP.

[61]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[62]  Masatoshi Tsuchiya,et al.  Class-Based N-Gram Language Model for New Words Using Out-of-Vocabulary to In-Vocabulary Similarity , 2012, IEICE Trans. Inf. Syst..

[63]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[64]  Mark Dredze,et al.  Contextual Information Improves OOV Detection in Speech , 2010, NAACL.

[65]  Georges Linarès,et al.  On-demand new word learning using world wide web , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[66]  Gang Li,et al.  Vocabulary and Language Model Adaptation Using just One File , 2010 .

[67]  Alexandre Allauzen,et al.  Training and Evaluation of POS Taggers on the French MULTITAG Corpus , 2008, LREC.

[68]  Bhuvana Ramabhadran,et al.  Towards using hybrid word and fragment units for vocabulary independent LVCSR systems , 2009, INTERSPEECH.

[69]  Giuseppe Riccardi,et al.  Semantic language models for Automatic Speech Recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[70]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[71]  Irina Illina,et al.  Continuous word representation using neural networks for proper name retrieval from diachronic documents , 2015, INTERSPEECH.

[72]  Johannes Fürnkranz,et al.  Large-Scale Multi-label Text Classification - Revisiting Neural Networks , 2013, ECML/PKDD.

[73]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[74]  Tong Zhang,et al.  Effective Use of Word Order for Text Categorization with Convolutional Neural Networks , 2014, NAACL.

[75]  Ramón Fernández Astudillo,et al.  Not All Contexts Are Created Equal: Better Word Representations with Variable Attention , 2015, EMNLP.

[76]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[77]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.