Discriminative feature extraction for speech recognition using continuous output codes

Feature transformation techniques have been widely investigated to reduce feature redundancy and to introduce additional discriminative information with the aim to improve the performance of automatic speech recognition (ASR). In this paper, we propose a novel method to obtain discriminative feature transformation based on output coding technique for speech recognition. The output coding transformation projects the speech features from their original space to a new one where each dimension of the features captures information to distinguish different phones. Using polynomial expansion, the short-time spectral features are first expanded to a high-dimensional space where the generalized linear discriminant sequence kernel is applied on the sequences of input feature vectors. Then, the output coding transformation formulated via a set of linear SVMs projects the sequences of high dimensional vectors into a tractable low-dimensional feature space where the resultant features are well-separated continuous output codes for the subsequent multi-class classification problem. Our experimental results on the TIMIT corpus show that the proposed features achieve 10.5% ASR error rate reduction over the conventional spectral features.

[1]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[3]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[4]  Heinrich Niemann,et al.  Optimal linear feature transformations for semi-continuous hidden Markov models , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[5]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[6]  Alex Acero,et al.  Maximum mutual information SPLICE transform for seen and unseen conditions , 2005, INTERSPEECH.

[7]  Jean-Luc Gauvain,et al.  High performance speaker-independent phone recognition using CDHMM , 1993, EUROSPEECH.

[8]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Shu Lin,et al.  Error control coding : fundamentals and applications , 1983 .

[11]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[12]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[13]  Alain Biem,et al.  Cepstrum-based filter-bank design using discriminative feature extraction training at various levels , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Tara N. Sainath,et al.  An exploration of large vocabulary tools for small vocabulary phonetic recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[15]  Koby Crammer,et al.  Improved Output Coding for Classification Using Continuous Relaxation , 2000, NIPS.

[16]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[17]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[18]  Steve Young,et al.  A review of large-vocabulary continuous-speech , 1996, IEEE Signal Process. Mag..

[19]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20]  Thomas G. Dietterich,et al.  Error-Correcting Output Coding Corrects Bias and Variance , 1995, ICML.

[21]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[22]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[23]  Giorgio Valentini,et al.  Effectiveness of error correcting output coding methods in ensemble and monolithic learning machines , 2003 .

[24]  Alain Biem,et al.  Feature extraction based on minimum classification error/generalized probabilistic descent method , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[26]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..