Bottleneck features based on gammatone frequency cepstral coefficients

Recent work demonstrates impressive success of the bottleneck (BN) feature in speech recognition, particularly with deep networks plus appropriate pre-training. A widely admitted advantage associated with the BN feature is that the network structure can learn multiple environmental conditions with abundant training data. For tasks with limited training data, however, this multi-condition training is unavailable, and so the networks tend to be over-fitted and sensitive to acoustic condition changes. A possible solution is to base the BN features on a channel-robust primary feature. In this paper, we propose to derive the BN feature based on Gammatone frequency cepstral coefficients (GFCCs). The GFCC feature has shown nice robustness against acoustic change, due to its capability of simulating the auditory system of humans. The idea is to integrate the advantage of the GFCC feature in acoustic robustness and the advantage of the BN feature in signal representation, so that the BN feature can be improved in the condition of mismatched training/test channels. This is particularly useful for small-scale tasks for which the training data are often limited. The experiments are conducted on the WSJCAM0 database, where the test utterances are mixed with noises at various SNR levels to simulate the channel change. The results confirm that the GFCC-based BN feature is much more robust than the BN features based on the MFCC and the PLP. Furthermore, the primary GFCC feature and the GFCC-based BN feature can be concatenated, leading to a more robust combined feature which provides considerable performance gains in all the tested noise conditions.

[1]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2]  Simon King,et al.  Growing bottleneck features for tandem ASR , 2008, INTERSPEECH.

[3]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  DeLiang Wang,et al.  CASA-Based Robust Speaker Identification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Yi Jiang,et al.  Auditory features based on Gammatone filters for robust speech recognition , 2013, 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013).

[6]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[7]  DeLiang Wang,et al.  An auditory-based feature for robust speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Andreas Stolcke,et al.  On using MLP features in LVCSR , 2004, INTERSPEECH.

[9]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[10]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[11]  E. Zwicker,et al.  Subdivision of the audible frequency range into critical bands , 1961 .

[12]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.