Structured discriminative models using deep neural-network features

State-of-the-art speech recognisers employ neural networks in various configurations. A standard (hybrid) speech recogniser computes the likelihood for one time frame and state, using only one out of thousands of possible neural-network outputs. However, the whole output vector carries information. In this paper, features from state-of-the-art speech recognisers are collected per phone given a particular context, and input to a discriminative log-linear model. The log-linear model is trained with conditional maximum likelihood or a large-margin criterion. A key element is the prior on the parameters of the log-linear model. The mean of the prior is set to the point where the performance of the original systems is attained. The log-linear model then provides an additional increase over the state-of-the-art performance of the individual systems.

[1]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[2]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[3]  Mark J. F. Gales,et al.  Kernelized log linear models for continuous speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Mark J. F. Gales,et al.  Combining tandem and hybrid systems for improved speech recognition and keyword spotting on low resource languages , 2014, INTERSPEECH.

[5]  Mark Gales,et al.  Structured Discriminative Models For Speech Recognition: An Overview , 2012, IEEE Signal Processing Magazine.

[6]  Chao Zhang,et al.  A general artificial neural network extension for HTK , 2015, INTERSPEECH.

[7]  Mark J. F. Gales,et al.  Structured Log Linear Models for Noise Robust Speech Recognition , 2010, IEEE Signal Processing Letters.

[8]  Ralf Schlüter,et al.  Investigation on cross- and multilingual MLP features under matched and mismatched acoustical conditions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[10]  Geoffrey Zweig,et al.  Integrating meta-information into exemplar-based speech recognition with segmental conditional random fields , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[12]  Mark J. F. Gales,et al.  Efficient decoding with generative score-spaces using the expectation semiring , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14]  Mark J. F. Gales,et al.  Unicode-based graphemic systems for limited resource languages , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[16]  Geoffrey Zweig,et al.  A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[18]  Eric Fosler-Lussier,et al.  Efficient Segmental Conditional Random Fields for One-Pass Phone Recognition , 2012, INTERSPEECH.

[19]  Mark J. F. Gales,et al.  Inference algorithms for generative score-spaces , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[21]  Mark J. F. Gales,et al.  Joint decoding of tandem and hybrid systems for improved keyword spotting on low resource languages , 2015, INTERSPEECH.

[22]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[23]  Hermann Ney,et al.  Multilingual MRASTA features for low-resource keyword search and speech recognition systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Steve Renals,et al.  Revisiting hybrid and GMM-HMM system combination techniques , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.