Discriminative Reranking for LVCSR Leveraging Invariant Structure

An invariant structure is one of the long-span acoustic representations, where acoustic variations caused by non-linguistic factors are effectively removed from speech. We present in this paper a new method to leverage the invariant structures as features of discriminative reranking for Large Vocabulary Continuous Speech Recognition (LVCSR). First we use a traditional HMMbased LVCSR system to get a list of N -best candidates with phone alignments and construct an invariant structure for each candidate using its phone alignment. Here, the invariant structure is composed of lengths between every two phonemes in the candidate. Then we estimate a score of each phoneme-pair in the invariant structure, and rerank the N -best candidates using a weighted sum of the phoneme-pair scores, where the weights are trained discriminatively by averaged perceptron. Experimental results show a relative CER improvement of 6.69% over the baseline HMM-based LVCSR system.

[1]  Nobuaki Minematsu Yet another acoustic representation of speech sounds , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Geoffrey Zweig,et al.  Speech recognitionwith segmental conditional random fields: A summary of the JHU CLSP 2010 Summer Workshop , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[4]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[5]  Brian Roark,et al.  Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[6]  Akinori Ito,et al.  Round-robin duel discriminative language models in one-pass decoding with on-the-fly error correction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Anoop Sarkar,et al.  Discriminative Reranking for Machine Translation , 2004, NAACL.

[8]  Mark J. F. Gales,et al.  Structured discriminative models for speech recognition , 2012, MLSLP.

[9]  Nobuaki Minematsu,et al.  Continuous Digits Recognition Leveraging Invariant Structure , 2011, INTERSPEECH.

[10]  Izhak Shafran,et al.  Discriminatively estimated joint acoustic, duration, and language model for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Geoffrey Zweig,et al.  Advances in speech transcription at IBM under the DARPA EARS program , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Ebru Arisoy,et al.  Feature Combination Approaches for Discriminative Language Models , 2011, INTERSPEECH.

[13]  Ryan T. McDonald,et al.  Scalable Large-Margin Online Learning for Structured Classification , 2005 .

[14]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[15]  Nobuaki Minematsu,et al.  A Study on Invariance of $f$-Divergence and Its Application to Speech Recognition , 2010, IEEE Transactions on Signal Processing.

[16]  Takaaki Hori,et al.  Efficient Discriminative Training of Error Corrective Models Using High-WER Competitors , 2008 .