CNN: A speaker recognition system using a cascaded neural network

The main emphasis of this paper is to present an approach for combining supervised and unsupervised neural network models to the issue of speaker recognition. To enhance the overall operation and performance of recognition, the proposed strategy integrates the two techniques, forming one global model called the cascaded model. We first present a simple conventional technique based on the distance measured between a test vector and a reference vector for different speakers in the population. This particular distance metric has the property of weighting down the components in those directions along which the intraspeaker variance is large. The reason for presenting this method is to clarify the discrepancy in performance between the conventional and neural network approach. We then introduce the idea of using unsupervised learning technique, presented by the winner-take-all model, as a means of recognition. Due to several tests that have been conducted and in order to enhance the performance of this model, dealing with noisy patterns, we have preceded it with a supervised learning model--the pattern association model--which acts as a filtration stage. This work includes both the design and implementation of both conventional and neural network approaches to recognize the speakers templates--which are introduced to the system via a voice master card and preprocessed before extracting the features used in the recognition. The conclusion indicates that the system performance in case of neural network is better than that of the conventional one, achieving a smooth degradation in respect of noisy patterns, and higher performance in respect of noise-free patterns.

[1]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[2]  B.S. Atal,et al.  Automatic recognition of speakers from their voices , 1976, Proceedings of the IEEE.

[3]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[4]  Philip D. Wasserman,et al.  Neural computing - theory and practice , 1989 .

[5]  G. W. Hughes,et al.  Talker differences as they appear in correlation matrices of continuous speech spectra. , 1974, The Journal of the Acoustical Society of America.

[6]  K. P. Li,et al.  An approach to text-independent speaker recognition with short utterances , 1983, ICASSP.

[7]  Bernhard R. Kämmerer,et al.  Experiments for isolated-word recognition with single- and two-layer perceptrons , 1990, Neural Networks.

[8]  A. Oppenheim,et al.  Homomorphic analysis of speech , 1968 .

[9]  D. Albesano,et al.  Correlative training and recurrent network automata for speech recognition , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[10]  Anthony Kuh,et al.  A combined self-organizing feature map and multilayer perceptron for isolated word recognition , 1992, IEEE Trans. Signal Process..

[11]  B.P. Yuhas,et al.  Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.

[12]  Stephen Grossberg,et al.  Variable Rate Working Memories for Phonetic Categorization and Invariant Speech Perception , 1993 .

[13]  Terrence J. Sejnowski,et al.  NETtalk: a parallel network that learns to read aloud , 1988 .

[14]  Adam Blum,et al.  Neural Networks in C++: An Object-Oriented Framework for Building Connectionist Systems , 1992 .

[15]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[16]  Gail A. Carpenter,et al.  Evaluation of Speaker Normalization Methods for Vowel Recognition Using Fuzzy ARTMAP and K-NN , 1993 .

[17]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[18]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[19]  Richard M. Schwartz,et al.  The application of probability density estimation to text-independent speaker identification , 1982, ICASSP.

[20]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..