Gammatone Features and Feature Combination for Large Vocabulary Speech Recognition

In this work, an acoustic feature set based on a gammatone filterbank is introduced for large vocabulary speech recognition. The gammatone features presented here lead to competitive results on the EPPS English task, and considerable improvements were obtained by subsequent combination to a number of standard acoustic features, i.e. MFCC, PLP, MF-PLP, and VTLN plus voicedness. Best results were obtained when combining gammatone features to all other features using weighted ROVER, resulting in a relative improvement of about 12% in word error rate compared to the best single feature system. We also found that ROVER gives better results for feature combination than both log-linear model combination and LDA.

[1]  H. Kalmus Biological Cybernetics , 1972, Nature.

[2]  E. de Boer,et al.  On cochlear encoding: Potentialities and limitations of the reverse‐correlation technique , 1978 .

[3]  E. de Boer,et al.  On cochlear encoding: potentialities and limitations of the reverse-correlation technique. , 1978, The Journal of the Acoustical Society of America.

[4]  D. D. Greenwood A cochlear frequency-position function for several species--29 years later. , 1990, The Journal of the Acoustical Society of America.

[5]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[6]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[7]  E. Lopez-Poveda,et al.  A human nonlinear cochlear filterbank. , 2001, The Journal of the Acoustical Society of America.

[8]  A. Aertsen,et al.  Spectro-temporal receptive fields of auditory neurons in the grassfrog , 1980, Biological Cybernetics.

[9]  Spectro-temporal receptive fields of auditory neurons in the grassfrog , 2004, Biological Cybernetics.

[10]  Werner Hemmert,et al.  Auditory-based automatic speech recognition , 2004, SAPA@INTERSPEECH.

[11]  Hermann Ney,et al.  Acoustic feature combination for robust speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  Hermann Ney,et al.  Cross domain automatic transcription on the TC-STAR EPPS corpus , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[13]  Hermann Ney,et al.  Frame based system combination and a comparison with weighted ROVER and CNC , 2006, INTERSPEECH.

[14]  Georg Heigold,et al.  The 2006 RWTH parliamentary speeches transcription system , 2006, INTERSPEECH.

[15]  Hermann Ney,et al.  Feature combination using linear discriminant analysis and its pitfalls , 2006, INTERSPEECH.