Feature extraction with a multiscale modulation analysis for robust automatic speech recognition

In this work we present a new feature extraction method that is robust against the effects of varying vocal tract lengths. The principle of the method is based on invariant integration and makes use of a modulation filtering approach, similar to the recently proposed scattering transform. In particular, we show how the transform can be used to obtain features that are robust against variations of the vocal tract length. Phoneme recognition experiments show a clearly increased robustness in case of mismatching average vocal tract lengths.

[1]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[2]  Alfred Mertins,et al.  Robust Features for Speaker-Independent Speech Recognition Based on a Certain Class of Translation-Invariant Transformations , 2009, NOLISP.

[3]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Reinhold Häb-Umbach,et al.  A study on speaker normalization using vocal tract normalization and speaker adaptive training , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Alfred Mertins,et al.  Generalized cyclic transformations in speaker-independent speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[6]  B. Moore,et al.  Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. , 1983, The Journal of the Acoustical Society of America.

[7]  Alfred Mertins,et al.  Frequency-Warping Invariant Features for Automatic Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Leon Cohen,et al.  The scale representation , 1993, IEEE Trans. Signal Process..

[9]  Alfred Mertins,et al.  Contextual invariant-integration features for improved speaker-independent speech recognition , 2011, Speech Commun..

[10]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[12]  Joakim Andén,et al.  Multiscale Scattering for Audio Classification , 2011, ISMIR.

[13]  W. Bastiaan Kleijn,et al.  Selecting static and dynamic features using an advanced auditory model for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  H. Schulz-Mirbach On the existence of complete invariant feature spaces in pattern recognition , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[15]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[16]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[17]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[18]  Leon Cohen,et al.  Scale transform in speech analysis , 1999, IEEE Trans. Speech Audio Process..

[19]  Torsten Daub Modeling auditory processing of amplitude modulation I. Detection and masking with narrow-band carriers , 1997 .

[20]  T. Dau Modeling auditory processing of amplitude modulation , 1997 .