Multi-stream spectro-temporal and cepstral features based on data-driven hierarchical phoneme clusters

We propose a method to enhance multi-stream Gabor and MFCC features using data-driven hierarchical phoneme clusters to yield more discriminating posteriors. We take into account different hierarchy structures, and in addition perform mean and variance normalization. A relative improvement of 11.5% over the conventional MFCC Tandem system was achieved in experiments conducted on Mandarin broadcast news. We analyze the complementarity between Gabor and MFCC features for different types of phonemes, and investigate the benefits that come from using hierarchical phoneme clusters.

[1]  Fabio Valente,et al.  Hierarchical and parallel processing of modulation spectrum for ASR applications , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  Nelson Morgan,et al.  Using spectro-temporal features to improve AFE feature extraction for ASR , 2010, INTERSPEECH.

[4]  Frantisek Grézl,et al.  Improved MLP structures for data-driven feature extraction for ASR , 2005, INTERSPEECH.

[5]  Nelson Morgan,et al.  Multi-stream spectro-temporal features for robust speech recognition , 2008, INTERSPEECH.

[6]  Hynek Hermansky,et al.  Robust spectro-temporal features based on autoregressive models of Hilbert envelopes , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Hynek Hermansky,et al.  Recognition of Reverberant Speech Using Frequency Domain Linear Prediction , 2008, IEEE Signal Processing Letters.

[8]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  David Gelbart,et al.  Improving word accuracy with Gabor feature extraction , 2002, INTERSPEECH.

[10]  Birger Kollmeier,et al.  Complementarity of MFCC, PLP and Gabor features in the presence of speech-intrinsic variabilities , 2009, INTERSPEECH.

[11]  S A Shamma,et al.  Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. , 2001, Journal of neurophysiology.

[12]  Lin-Shan Lee,et al.  Data-driven clustered hierarchical tandem system for LVCSR , 2008, INTERSPEECH.

[13]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Hynek Hermansky,et al.  A multistream multiresolution framework for phoneme recognition , 2010, INTERSPEECH.

[15]  Hervé Bourlard,et al.  Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Christian E Stilp,et al.  Cochlea-scaled entropy, not consonants, vowels, or time, best predicts speech intelligibility , 2010, Proceedings of the National Academy of Sciences.

[17]  Frank Joublin,et al.  Hierarchical spectro-temporal features for robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Lin-Shan Lee,et al.  Improved phoneme recognition by integrating evidence from spectro-temporal and cepstral features , 2010, INTERSPEECH.

[19]  David Gelbart,et al.  Ensemble Feature Selection for Multi-Stream Automatic Speech Recognition , 2008 .

[20]  S. Shamma,et al.  Spectro-temporal modulation transfer functions and speech intelligibility. , 1999, The Journal of the Acoustical Society of America.

[21]  Nelson Morgan,et al.  Multi-stream to many-stream: using spectro-temporal features for ASR , 2009, INTERSPEECH.