On the use of ideal binary masks for improving phonetic classification

Ideal binary masks are binary patterns that encode the masking characteristics of speech in noise. Recent evidence in speech perception suggests that such binary patterns provide sufficient information for human speech recognition. Motivated by these findings, we propose to use ideal binary masks to improve phonetic modeling. We show that by combining the outputs of classifiers trained on the traditional MFCC features and this novel speech pattern, statistically significant improvements over the baseline MFCC based classifier can be achieved for the task of phonetic classification. Using the combined classifiers, we achieve an error rate of 19.5% on the TIMIT phonetic classification task using multilayer perceptrons as the underlying classifier.

[1]  Stephen A. Zahorian,et al.  Phone classification with segmental features and a binary-pair partitioned neural network classifier , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Pedro J. Moreno,et al.  On the use of support vector machines for phonetic classification , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[3]  Hung-An Chang,et al.  Hierarchical large-margin Gaussian mixture models for phonetic classification , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[4]  DeLiang Wang,et al.  Speech perception of noise with binary gains. , 2008, The Journal of the Acoustical Society of America.

[5]  DeLiang Wang,et al.  Robust speech recognition from binary masks. , 2010, The Journal of the Acoustical Society of America.

[6]  James R. Glass,et al.  Noise Robust Phonetic Classificationwith Linear Regularized Least Squares and Second-Order Features , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Steve J. Young,et al.  MMI training for continuous phoneme recognition on the TIMIT database , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Dong Yu,et al.  Hidden conditional random field with distribution constraints for phone classification , 2009, INTERSPEECH.

[9]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[10]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[11]  Lawrence K. Saul,et al.  Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[13]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[14]  Jan Larsen,et al.  Robust isolated speech recognition using binary masks , 2010, 2010 18th European Signal Processing Conference.

[15]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[16]  James R. Glass,et al.  Heterogeneous measurements and multiple classifiers for speech recognition , 1998, ICSLP.

[17]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[18]  Ching Y. Suen,et al.  Character Recognition Systems: A Guide for Students and Practitioners , 2007 .