Random discriminant structure analysis for automatic recognition of connected vowels

The universal structure of speech [1, 2], proves to be invariant to transformations in feature space, and thus provides a robust representation for speech recognition. One of the difficulties of using structure representation is due to its high dimensionality. This not only increases computational cost but also easily suffers from the curse of dimensionality [3, 4]. In this paper, we introduce random discriminant structure analysis (RDSA) to deal with this problem. Based on the observation that structural features are highly correlated and include large redundancy, the RDSA combines random feature selection and discriminative analysis to calculate several low dimensional and discriminative representations from an input structure. Then an individual classifier is trained for each representation and the outputs of each classifier are integrated for the final classification decision. Experimental results on connected Japanese vowel utterances show that our approach achieves a recognition rate of 98.3% based on the training data of 8 speakers, which is higher than that (97.4%) of HMMs trained with the utterances of 4,130 speakers.

[1]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[2]  Li Deng,et al.  Production models as a structural basis for automatic speech recognition , 1997, Speech Commun..

[3]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[5]  古井 貞煕,et al.  Digital speech processing, synthesis, and recognition , 1989 .

[6]  L. Williams,et al.  Contents , 2020, Ophthalmology (Rochester, Minn.).

[7]  Automatic recognition of connected vowels only using speaker-invariant representation of speech dynamics , 2007, INTERSPEECH.

[8]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[9]  Nobuaki Minematsu Mathematical evidence of the acoustic universal structure in speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Kiyohiro Shikano,et al.  Recent progress of open-source LVCSR engine julius and Japanese model repository , 2004, INTERSPEECH.

[11]  Robert P. W. Duin,et al.  Bagging, Boosting and the Random Subspace Method for Linear Classifiers , 2002, Pattern Analysis & Applications.

[12]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  K. Hirose,et al.  Japanese vowel recognition using external structure of speech , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[14]  S. Scott,et al.  The neuroanatomical and functional organization of speech perception , 2003, Trends in Neurosciences.

[15]  David G. Stork,et al.  Pattern Classification , 1973 .

[16]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[17]  Dahua Lin,et al.  Recognize High Resolution Faces: From Macrocosm to Microcosm , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[18]  Dong Yu,et al.  Structured speech modeling , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Nobuaki Minematsu Yet another acoustic representation of speech sounds , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..