Combined waveform-cepstral representation for robust speech recognition

High-dimensional acoustic waveform representations are studied as a front-end for noise robust automatic speech recognition using generative methods, in particular Gaussian mixture models and hidden Markov models. The proposed representations are compared with standard cepstral features on phoneme classification and recognition tasks. While lower error rates are achieved using cepstral features at very low noise levels, the acoustic waveform representations are much more robust to noise. A convex combination of acoustic waveforms and cepstral features is then considered and it achieves higher accuracy than either of the individual representations across all noise levels.

[1]  Maya R. Gupta,et al.  Bayesian Quadratic Discriminant Analysis , 2007, J. Mach. Learn. Res..

[2]  David Barber,et al.  Switching Linear Dynamical Systems for Noise Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[4]  B. Yu,et al.  Combined Features and Kernel Design for Noise Robust Phoneme Classification Using Support Vector Machines , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Floris Takens,et al.  On the numerical determination of the dimension of an attractor , 1985 .

[6]  P. Grassberger,et al.  Measuring the Strangeness of Strange Attractors , 1983 .

[7]  Louis D. Braida,et al.  Human and machine consonant recognition , 2005, Speech Commun..

[8]  Hamid Sheikhzadeh,et al.  Waveform-based speech recognition using hidden filter models: parameter selection and sensitivity to power normalization , 1994, IEEE Trans. Speech Audio Process..

[9]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[10]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[11]  G. A. Miller,et al.  The intelligibility of speech as a function of the context of the test materials. , 1951, Journal of experimental psychology.

[12]  G. A. Miller,et al.  An Analysis of Perceptual Confusions Among Some English Consonants , 1955 .

[13]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[14]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[15]  Matthias Hein Intrinsic Dimensionality Estimation of Submanifolds in R , 2005 .

[16]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[17]  P. Sollich,et al.  Discriminative and generative machine learning approaches towards robust phoneme classification , 2008, 2008 Information Theory and Applications Workshop.

[18]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[19]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[20]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[21]  Matthias Hein,et al.  Intrinsic dimensionality estimation of submanifolds in Rd , 2005, ICML.

[22]  Peter Sollich,et al.  High-dimensional linear representations for robust speech recognition , 2010, 2010 Information Theory and Applications Workshop (ITA).

[23]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .