Fuzzy-Clustering-Based Decision Tree Approach for Large Population Speaker Identification

In this paper, we address the problem of large population speaker identification under noisy conditions. Major techniques for speaker identification is based on Mel-Frequency Cepstral Coefficients (MFCC), Gaussian Mixture Model (GMM) and Universal Background Model (UBM) which we call MFCC+GMM and MFCC+GMM+UBM. The approaches are known to perform very well for small population identification under low-noise conditions. However, the increase of population size can cause performance degradation of these schemes under noisy conditions. To mitigate this limitation, we propose a fuzzy-clustering-based decision tree approach. The key idea of our approach is to 1) use a decision tree to hierarchically partition the whole population into groups of small size, and determine which speaker group at the leaf node a speaker under test belongs to, and 2) apply MFCC+GMM to the selected speaker group for speaker identification. The advantage of our approach is that we use features that are independent from MFCC to partition speakers into groups and only apply MFCC+GMM to speaker groups at the leaf level. The key challenge in our design is how to achieve a low error probability of decision-tree-based classification. To address this, we adopt fuzzy clustering in constructing the tree for population partitioning, i.e., at each level, a speaker may belong to multiple groups. Such redundancy increases the probability of classifying a speaker under test into a correct group/node on the tree. Another novelty of this paper is that we use pitch and five vocal source features to construct a six-level decision tree. Experimental results demonstrate that our approach outperforms MFCC+ GMM and MFCC+ GMM+ UBM with higher accuracy and lower complexity for large population identification under additive white Gaussian noise (AWGN) conditions.

[1]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[2]  Mireia Díez,et al.  On the Use of Dot Scoring for Speaker Diarization , 2011, IbPRIA.

[3]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[4]  D.A. Reynolds,et al.  Large population speaker identification using clean and telephone speech , 1995, IEEE Signal Processing Letters.

[5]  Stéphane H. Maes,et al.  Very large population text-independent speaker identification using transformation enhanced multi-grained models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[7]  Thomas Fang Zheng,et al.  A tree-based kernel selection approach to efficient Gaussian mixture model-universal background model based speaker identification , 2006, Speech Commun..

[8]  Palma Blonda,et al.  A survey of fuzzy clustering algorithms for pattern recognition. I , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[9]  R Togneri,et al.  An Overview of Speaker Identification: Accuracy and Robustness Issues , 2011, IEEE Circuits and Systems Magazine.

[10]  J.H.L. Hansen,et al.  An efficient scoring algorithm for Gaussian mixture model based speaker identification , 1998, IEEE Signal Processing Letters.

[11]  Longbiao Wang,et al.  Speaker Identification and Verification by Combining MFCC and Phase Information , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Victor Chi,et al.  Designing 802.11 WLANs for VoIP and Data , 2007 .

[13]  Danoush Hosseinzadeh,et al.  Combining Vocal Source and MFCC Features for Enhanced Speaker Recognition Performance Using GMMs , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[14]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[15]  Chao Wang,et al.  Prosodic modeling for improved speech recognition and understanding , 2001 .

[16]  Antonio Nucci,et al.  Pitch-based gender identification with two-stage classification , 2012, Secur. Commun. Networks.

[17]  S. Umesh,et al.  Fast approach to speaker identification for large population using MLLR and sufficient statistics , 2010, 2010 National Conference On Communications (NCC).

[18]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  David A. van Leeuwen,et al.  Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006 , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Fred Cummins,et al.  Speaker Identification Using Instantaneous Frequencies , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Jean Rouat,et al.  Combining pitch and MFCC for speaker identification systems , 2001, Odyssey.

[22]  Ning Wang,et al.  Robust Speaker Recognition Using Denoised Vocal Source and Vocal Tract Features , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[24]  Vijendra Raj Apsingekar,et al.  Speaker Model Clustering for Efficient Speaker Identification in Large Population Applications , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[26]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.