In-set/out-of-set speaker identification based on discriminative speech frame selection

In this paper, we propose a novel discriminative speech frame selection (DSFS) scheme for the problem of in-set/out-of-set speaker identification, which seeks to decrease the similarity between speaker models and background model (or antispeaker model), and increase the accuracy of speaker identification. The working scheme of DSFS consists of two steps: speech frame analysis and discriminative frame selection. Two methods are used to perform DSFS, (i) Teager Energy Operator (TEO) energy based and (ii) MELP pitch based methods. An evaluation using both clean and noisy corpora that include single and multiple recording sessions show that both TEO energy based and MELP pitch based DSFS schemes can reduce EER (equal error rate) dramatically over a traditional GMM-UBM baseline system. Compared with traditional GMM speaker identification, the DSFS is able to select only discriminative speech frames, and therefore consider only discriminative features. This selection is able to decrease the overlap between speaker models and background model, and improve the performance of in-set/out-of-set speaker identification.

[1]  J. F. Kaiser,et al.  On a simple algorithm to calculate the 'energy' of a signal , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[2]  Ioannis Pitas,et al.  Recent advances in biometric person authentication , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  W. B. Mikhael,et al.  Speaker verification/recognition and the importance of selective feature extraction: review , 2001, Proceedings of the 44th IEEE 2001 Midwest Symposium on Circuits and Systems. MWSCAS 2001 (Cat. No.01CH37257).

[4]  Ismail Shahin,et al.  Modeling and analyzing the vocal tract under normal and stressful talking conditions , 2001, Proceedings. IEEE SoutheastCon 2001 (Cat. No.01CH37208).

[5]  Sridha Sridharan,et al.  Speech compression with preservation of speaker identity , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  W. B. Mikhael,et al.  Speaker identification employing waveform based speech CODEC , 2002, The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002..

[7]  Mark Phythian,et al.  Effects of speech coding on text-dependent speaker recognition , 1997, TENCON '97 Brisbane - Australia. Proceedings of IEEE TENCON '97. IEEE Region 10 Annual Conference. Speech and Image Technologies for Computing and Telecommunications (Cat. No.97CH36162).

[8]  John H. L. Hansen,et al.  Identifying in-set and out-of-set speakers using neighborhood information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Douglas A. Reynolds,et al.  Modeling of the glottal flow derivative waveform with application to speaker identification , 1999, IEEE Trans. Speech Audio Process..

[10]  T.F. Quatieri,et al.  Energy onset times for speaker identification , 1994, IEEE Signal Processing Letters.

[11]  Steven J. Vaughan-Nichols Voice authentication speaks to the marketplace , 2004, Computer.

[12]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..