Rank-based frame classification for usable speech detection in speaker identification systems

The performance of a speaker identification (SID) system degrades substantially when there is a mismatch between the training and testing conditions. Discriminating between temporal sections of speech signals which are speech-like (SID usable) and noise-like (SID unusable) while only retaining frames labeled SID usable can augment SID performance substantially. In this paper, a novel labeling system for SID usable and SID unusable frames is presented for a GMM based SID system. This is motivated by a control experiment demonstrating that very high SID accuracies are theoretically achievable by removing frames that contribute more to the scores of competing speakers rather than the true speaker. To blindly identify these SID usable and unusable frames, the Mahalanobis distance and an ensemble of decision tree classifiers (with boosting) were trained on a dataset which was different from the enrollment database for the SID system. The classifier based techniques yielded improvements over the base speaker identification system (all frames used) in all cases when the speech signal was corrupted with additive white or additive pink noise.

[1]  Ning Wang,et al.  Robust Speaker Recognition Using Denoised Vocal Source and Vocal Tract Features , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Taghi M. Khoshgoftaar,et al.  RUSBoost: Improving classification performance when training data is skewed , 2008, 2008 19th International Conference on Pattern Recognition.

[3]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[4]  Brett Y. Smolenski,et al.  Feature and Signal Enhancement for Robust Speaker Identification of G.729 Decoded Speech , 2012, ICONIP.

[5]  B Y Smolenski,et al.  Usable speech processing: a filterless approach in the presence of interference , 2011, IEEE Circuits and Systems Magazine.

[6]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[7]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[8]  Gautham J. Mysore,et al.  Speaker and noise independent voice activity detection , 2013, INTERSPEECH.

[9]  David G. Stork,et al.  Pattern Classification , 1973 .

[10]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[11]  Ian T. Nabney,et al.  Netlab: Algorithms for Pattern Recognition , 2002 .

[12]  Karsten P. Ulland,et al.  Vii. References , 2022 .

[13]  R Togneri,et al.  An Overview of Speaker Identification: Accuracy and Robustness Issues , 2011, IEEE Circuits and Systems Magazine.

[14]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[15]  Brett Y. Smolenski,et al.  Enhancement of Speaker Identification using SID-usable speech , 2005, 2005 13th European Signal Processing Conference.

[16]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[17]  Goutam Saha,et al.  Spectral entropy and spectral shape based pre-quantization for real time speaker identification system , 2010, Int. J. Speech Technol..

[18]  Ronald W. Schafer,et al.  Theory and Applications of Digital Speech Processing , 2010 .