Étude sur les représentations continues de mots appliquées à la détection automatique des erreurs de reconnaissance de la parole. (A study of continuous word representations applied to the automatic detection of speech recognition errors)

My thesis concerns a study of continuous word representations applied to the automatic detection of speech recognition errors. Recent advances in the field of speech processing have led to significant improvements in speech recognition performances. However, recognition errors are still unavoidable. This reflects their sensitivity to the variability, e.g. to acoustic conditions, speaker, language style, etc. Our study focuses on the use of a neural approach to improve ASR error detection, using word embeddings. These representations have proven to be a great asset in various natural language processing tasks (NLP). The exploitation of continuous word representations is motivated by the fact that ASR error detection consists on locating the possible linguistic or acoustic incongruities in automatic transcriptions. The aim is therefore to find the appropriate word representation which makes it possible to capture pertinent information in order to be able to detect these anomalies. Our contribution in this thesis concerns several initiatives. First, we start with a preliminary study in which we propose a neural architecture able to integrate different types of features, including word embeddings. Second, we propose a deep study of continuous word representations. This study focuses on the evaluation of different types of linguistic word embeddings and their combination in order to take advantage of their complementarities. On the other hand, it focuses on acoustic embeddings. The proposed approach relies on the use of a convolution neural network to build acoustic signal embeddings, and a deep neural network to build acoustic word embeddings. In addition, we propose two approaches to evaluate the performance of acoustic word embeddings. We also propose to enrich the word representation, in input of the ASR error detection system, by prosodic features in addition to linguistic and acoustic embeddings. Integrating this information into our neural architecture provides a significant improvement in terms of classification error rate reduction in comparison to a conditional random field (CRF) based state-of-the-art approach. Then, we present a study on the analysis of classification errors, with the aim of perceiving the errors that are difficult to detect. Perspectives for improving the performance of our system are also proposed, by modelling the errors at the sentence level. Finally, we exploit the linguistic and acoustic embeddings as well as the information provided by our ASR error detection system in several downstream applications.

[1]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Bing Liu,et al.  Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling , 2016, INTERSPEECH.

[3]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[4]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[5]  Enrico Tronci 1997 , 1997, Les 25 ans de l’OMC: Une rétrospective en photos.

[6]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[7]  Manaal Faruqui,et al.  Community Evaluation and Exchange of Word Vectors at wordvectors.org , 2014, ACL.

[8]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Yann LeCun,et al.  Convolutional neural networks applied to house numbers digit classification , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[10]  Michael Picheny,et al.  Semantic confidence measurement for spoken dialog systems , 2005, IEEE Transactions on Speech and Audio Processing.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[13]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[14]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[15]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[16]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[17]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[18]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[20]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[21]  Antoine Cornuéjols,et al.  Apprentissage artificiel - Concepts et algorithmes , 2003 .

[22]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[23]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[24]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[25]  Elmar Nöth,et al.  Comparison and Combination of Confidence Measures , 2002, TSD.

[26]  Ming Zhou,et al.  Sentiment Embeddings with Applications to Sentiment Analysis , 2016, IEEE Transactions on Knowledge and Data Engineering.

[27]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[28]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[29]  Koichi Shinoda,et al.  An efficient error correction interface for speech recognition on mobile touchscreen devices , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[30]  Yun Lei,et al.  ASR error detection using recurrent neural network language model and complementary ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[32]  S. V. N. Vishwanathan,et al.  WordRank: Learning Word Embeddings via Robust Ranking , 2015, EMNLP.

[33]  Mitch Weintraub,et al.  Explicit word error minimization in n-best list rescoring , 1997, EUROSPEECH.

[34]  J. Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: A computational study , 2007, Behavior research methods.

[35]  Sanjeev Arora,et al.  A Latent Variable Model Approach to PMI-based Word Embeddings , 2015, TACL.

[36]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[37]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[38]  Julia Hirschberg,et al.  Rescoring Confusion Networks for Keyword Search , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Paul Deléglise,et al.  Improvements to the LIUM French ASR system based on CMU sphinx: what helps to significantly reduce the word error rate? , 2009, INTERSPEECH.

[40]  Thomas Niesler,et al.  Improvements in accuracy and speed in the HTK broadcast news transcription system , 1999, EUROSPEECH.

[41]  Sophie Rosset,et al.  Semantic annotation of the French media dialog corpus , 2005, INTERSPEECH.

[42]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[43]  Alfons Juan-Císcar,et al.  Estimating Confidence Measures for Speech Recognition Verification Using a Smoothed Naive Bayes Model , 2003, IbPRIA.

[44]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[45]  Julia Hirschberg,et al.  Localized detection of speech recognition errors , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[46]  Hakan Erdogan,et al.  Incremental on-line feature space MLLR adaptation for telephony speech recognition , 2002, INTERSPEECH.

[47]  Paul Deléglise,et al.  Computer-assisted transcription of speech based on confusion network reordering , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Julia Hirschberg,et al.  Exploring Features For Localized Detection of Speech Recognition Errors , 2013, SIGDIAL Conference.

[49]  Delphine Charlet,et al.  On combining confidence measures for improved rejection of incorrect data , 2001, INTERSPEECH.

[50]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[51]  Graham Taylor,et al.  G ENERATIVE CLASS-CONDITIONAL DENOISING AU-TOENCODERS , 2014 .

[52]  Yann Le Cun,et al.  A Theoretical Framework for Back-Propagation , 1988 .

[53]  E. Tronci,et al.  1996 , 1997, Affair of the Heart.

[54]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[55]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[56]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[57]  Dong-Hong Ji,et al.  A topic-enhanced word embedding for Twitter sentiment classification , 2016, Inf. Sci..

[58]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[59]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[60]  Honglak Lee,et al.  Unsupervised learning of hierarchical representations with convolutional deep belief networks , 2011, Commun. ACM.

[61]  Georges Linarès,et al.  Combined low level and high level features for out-of-vocabulary word detection , 2009, INTERSPEECH.

[62]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[63]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[64]  Tetsuya Takiguchi,et al.  Word-Error Correction of Continuous Speech Recognition Based on Normalized Relevance Distance , 2015, IJCAI.

[65]  Guillaume Gravier,et al.  The ESTER phase II evaluation campaign for the rich transcription of French broadcast news , 2005, INTERSPEECH.

[66]  Ren-Hua Wang,et al.  A comparative study on various confidence measures in large vocabulary speech recognition , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[67]  Lakhmi C. Jain,et al.  Recurrent Neural Networks: Design and Applications , 1999 .

[68]  Geoffrey Zweig,et al.  Spoken language understanding using long short-term memory neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[69]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[70]  Samy Bengio,et al.  Links between perceptrons, MLPs and SVMs , 2004, ICML.

[71]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[72]  Pascal Vincent,et al.  The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training , 2009, AISTATS.

[73]  Themos Stafylakis,et al.  I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  Hang Lei,et al.  An Empirical Study on Sentiment Classification of Chinese Review using Word Embedding , 2015, PACLIC.

[75]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[76]  Daniel Jurafsky,et al.  Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates , 2010, Speech Commun..

[77]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[78]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[79]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[80]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[81]  Joseph Polifroni,et al.  Recognition confidence scoring and its use in speech understanding systems , 2002, Comput. Speech Lang..

[82]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[83]  Thierry Artières,et al.  Champs de Markov conditionnels pour le traitement de séquences , 2006, EGC.

[84]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[85]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[86]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[87]  Ewan Dunbar,et al.  A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling , 2015, INTERSPEECH.

[88]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[89]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[90]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[91]  Kevin Gimpel,et al.  Tailoring Continuous Word Representations for Dependency Parsing , 2014, ACL.

[92]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[93]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[94]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[95]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[96]  Richard Dufour,et al.  Characterizing and detecting spontaneous speech: Application to speaker role recognition , 2014, Speech Commun..

[97]  Olivier Galibert,et al.  The First Official REPERE Evaluation , 2013, SLAM@INTERSPEECH.

[98]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[99]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[100]  Atsunori Ogawa,et al.  ASR error detection and recognition rate estimation using deep bidirectional recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[101]  Sarah Flora Samson Juan Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia. (Utilisation de ressources dans une langue proche pour la reconnaissance automatique de la parole pour les langues peu dotées de Malaisie) , 2015 .

[102]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[103]  Aren Jansen,et al.  Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[104]  Holger Schwenk,et al.  Continuous Space Language Models for Statistical Machine Translation , 2006, ACL.

[105]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[106]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[107]  Andrew L. Maas,et al.  Word-level Acoustic Modeling with Convolutional Vector Regression , 2012 .

[108]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[109]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[110]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[111]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[112]  Karen Livescu,et al.  Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[113]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[114]  Frédéric Béchet,et al.  Conceptual decoding from word lattices: application to the spoken dialogue corpus MEDIA , 2006, INTERSPEECH.

[115]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[116]  L. Deng,et al.  Calibration of Confidence Measures in Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[117]  Paul Deléglise,et al.  LIUM and CRIM ASR System Combination for the REPERE Evaluation Campaign , 2014, TSD.

[118]  Yann LeCun,et al.  Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches , 2015, J. Mach. Learn. Res..

[119]  Mark Dredze,et al.  Contextual Information Improves OOV Detection in Speech , 2010, NAACL.

[120]  C. Uhrik,et al.  Confidence metrics based on n-gram language model backoff behaviors , 1997, EUROSPEECH.

[121]  Geoffrey Zweig,et al.  Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[122]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[123]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[124]  Hermann Ney,et al.  Improved MLLR speaker adaptation using confidence measures for conversational speech recognition , 2000, INTERSPEECH.

[125]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[126]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[127]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[128]  Martine Adda-Decker,et al.  Contributions du traitement automatique de la parole à l’étude des voyelles orales du français , 2008 .

[129]  Geoffrey Zweig,et al.  Sequence-to-sequence neural net models for grapheme-to-phoneme conversion , 2015, INTERSPEECH.

[130]  Jean-Luc Gauvain,et al.  Transcription de la parole conversationnelle , 2004 .

[131]  Paul Deléglise,et al.  Exploring the use of Attention-Based Recurrent Neural Networks For Spoken Language Understanding , 2015, NIPS 2015.

[132]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[133]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[134]  Yoshua Bengio,et al.  Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism , 2016, NAACL.

[135]  Giuseppe Riccardi,et al.  Generative and discriminative algorithms for spoken language understanding , 2007, INTERSPEECH.

[136]  Mohamed Morchid,et al.  Integration of Word and Semantic Features for Theme Identification in Telephone Conversations , 2015, Natural Language Dialog Systems and Intelligent Assistants.

[137]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[138]  Thorsten Brants,et al.  Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation , 2008, ACL.

[139]  Aren Jansen,et al.  Rapid Evaluation of Speech Representations for Spoken Term Discovery , 2011, INTERSPEECH.

[140]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[141]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[142]  Giuseppe Riccardi,et al.  Acoustic and word lattice based algorithms for confidence scores , 2002, INTERSPEECH.

[143]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[144]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[145]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[146]  Daniele Falavigna,et al.  Stacked auto-encoder for ASR error detection and word error rate prediction , 2015, INTERSPEECH.

[147]  Daniel Jurafsky,et al.  Do Multi-Sense Embeddings Improve Natural Language Understanding? , 2015, EMNLP.

[148]  F. Jelinek,et al.  Perplexity—a measure of the difficulty of speech recognition tasks , 1977 .

[149]  Stephen Cox,et al.  High-level approaches to confidence estimation in speech recognition , 2002, IEEE Trans. Speech Audio Process..

[150]  Stephen Clark,et al.  Specializing Word Embeddings for Similarity or Relatedness , 2015, EMNLP.

[151]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[152]  Frédéric Béchet,et al.  ASR error segment localization for spoken recovery strategy , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[153]  Paul Deléglise,et al.  LIUM ASR System for ETAPE French Evaluation Campaign: Experiments on System Combination Using Open-Source Recognizers , 2013, TSD.

[154]  Mitch Weintraub,et al.  Neural-network based measures of confidence for word recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[155]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[156]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[157]  William T. Freeman,et al.  Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology , 1999, Neural Computation.

[158]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[159]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[160]  Geoffrey E. Hinton,et al.  Experiments on Learning by Back Propagation. , 1986 .

[161]  Julia Hirschberg,et al.  Prosodic and other cues to speech recognition failures , 2004, Speech Commun..

[162]  Evgeniy Gabrilovich,et al.  Large-scale learning of word relatedness with constraints , 2012, KDD.

[163]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[164]  Steve J. Young,et al.  State clustering in hidden Markov model-based continuous speech recognition , 1994, Comput. Speech Lang..

[165]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[166]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[167]  Navdeep Jaitly,et al.  Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition , 2012, INTERSPEECH.

[168]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[169]  Frédéric Béchet,et al.  MACAON An NLP Tool Suite for Processing Word Lattices , 2011, ACL.

[170]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[171]  Hermann Ney,et al.  A comparison of word graph and n-best list based confidence measures , 1999, EUROSPEECH.

[172]  Rong Zhang,et al.  Word level confidence annotation using combinations of features , 2001, INTERSPEECH.

[173]  Bhiksha Raj,et al.  A boosting approach for confidence scoring , 2001, INTERSPEECH.

[174]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[175]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[176]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[177]  Delphine Charlet,et al.  Automatic error region detection and characterization in LVCSR transcriptions of TV news shows , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[178]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[179]  Isabel Trancoso,et al.  Improving ASR error detection with non-decoder based features , 2010, INTERSPEECH.

[180]  Georg Heigold,et al.  Word embeddings for speech recognition , 2014, INTERSPEECH.

[181]  Guillaume Gravier,et al.  The ester 2 evaluation campaign for the rich transcription of French radio broadcasts , 2009, INTERSPEECH.

[182]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[183]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[184]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[185]  Eduardo Mena,et al.  Web-Based Measure of Semantic Relatedness , 2008, WISE.

[186]  Hsiao-Wuen Hon,et al.  An overview of the SPHINX speech recognition system , 1990, IEEE Trans. Acoust. Speech Signal Process..

[187]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[188]  Paul Lamere,et al.  Sphinx-4: a flexible open source framework for speech recognition , 2004 .