Discriminative re-ranking for automatic speech recognition by leveraging invariant structures

Abstract An invariant structure was proposed in Minematsu (2004) and Minematsu et al. (2010) and it is a long-span feature to suppress non-linguistic factors. In contrast to frame-based features such as Mel-Frequency Cepstrum Coefficients (MFCC), the invariant structures are extracted as contrasts between speech events in a given utterance. Because the invariant structure is not a time series of short-term features, it is difficult to use it directly in the general framework of Automatic Speech Recognition (ASR) although its robustness against non-linguistic factors is desirable for ASR. To introduce the invariant structure effectively to ASR, we are working on a method to leverage the invariant structure in a discriminative re-ranking paradigm for ASR. In our re-ranking paradigm, a baseline ASR system is used to generate N-best lists with hypothesized phoneme-level alignments so that we can extract one invariant structure for each hypothesis. We also propose methods to convert an extracted invariant structure into a fixed-dimensional feature vector to be used in discriminative re-ranking. Experimental results on the three tasks of continuous digit recognition, digit recognition in noisy environments, and large vocabulary continuous speech recognition showed significant error reductions and robustness improvements against noisy environments.

[1]  Nobuaki Minematsu,et al.  Improved and robust prediction of pronunciation distance for individual-basis clustering of World Englishes pronunciation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Keikichi Hirose,et al.  Integration of multilayer regression analysis with structure-based pronunciation assessment , 2010, INTERSPEECH.

[3]  Nobuaki Minematsu,et al.  A Study on Invariance of $f$-Divergence and Its Application to Speech Recognition , 2010, IEEE Transactions on Signal Processing.

[4]  Takaaki Hori,et al.  Efficient Discriminative Training of Error Corrective Models Using High-WER Competitors , 2008 .

[5]  Brian Roark,et al.  Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[6]  Nobuaki Minematsu,et al.  Speech Structure and Its Application to Robust Speech Processing , 2009, New Generation Computing.

[7]  Hakan Erdogan,et al.  Incremental on-line feature space MLLR adaptation for telephony speech recognition , 2002, INTERSPEECH.

[8]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[9]  Nobuaki Minematsu,et al.  Affine invariant features and their application to speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Keikichi Hirose,et al.  Multi-stream parameterization for structural speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[12]  Anoop Sarkar,et al.  Discriminative Reranking for Machine Translation , 2004, NAACL.

[13]  Keikichi Hirose,et al.  On invariant structural representation for speech recognition: theoretical validation and experimental improvement , 2009, INTERSPEECH.

[14]  Nobuaki Minematsu Yet another acoustic representation of speech sounds , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  George Saon,et al.  Feature space Gaussianization , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Keikichi Hirose,et al.  Sub-structure-based estimation of pronunciation proficiency and classification of learners , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  Alfred Mertins,et al.  Contextual invariant-integration features for improved speaker-independent speech recognition , 2011, Speech Commun..

[18]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[19]  Nobuaki Minematsu,et al.  A study on Hidden Structural Model and its application to labeling sequences , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[20]  Nobuaki Minematsu,et al.  Continuous Digits Recognition Leveraging Invariant Structure , 2011, INTERSPEECH.

[21]  A. Mertins,et al.  Vocal tract length invariant features for automatic speech recognition , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[22]  Izhak Shafran,et al.  Discriminatively estimated joint acoustic, duration, and language model for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[24]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[25]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[26]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[27]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[28]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[29]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[30]  Nobuaki Minematsu,et al.  Discriminative Reranking for LVCSR Leveraging Invariant Structure , 2012, INTERSPEECH.

[31]  Keikichi Hirose,et al.  Optimal event search using a structural cost function - improvement of structure to speech conversion , 2009, INTERSPEECH.

[32]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[33]  Mark Gales,et al.  Structured Discriminative Models For Speech Recognition: An Overview , 2012, IEEE Signal Processing Magazine.

[34]  Nobuaki Minematsu,et al.  Random discriminant structure analysis for automatic recognition of connected vowels , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).