Using phoneme recognition and text-dependent speaker verification to improve speaker segmentation for Chinese speech

Speaker segmentation is widely used in many tasks such as multi-speaker detection and speaker tracking. The segmentation performance depends on the performance of speaker verification (SV) between two short utterances to a large extent, so the improvement of the SV performance for short utterances would give the segmentation performance a great help. In this paper, a method based on phoneme recognition and text-dependent speaker recognition is proposed. During segmentation, a phoneme sequence is first recognized using a phoneme recognizer and then text-dependent speaker recognition based on dynamic time warping (DTW) is performed on the same phoneme in two adjacent windows. Experiments over Chinese Corpus Consortium (CCC) MSS database showed that better performance was achieved compared with the BIC method and the GLR method. Index Terms: speaker segmentation, phoneme recognition, text-dependent, short utterances

[1]  Pietro Laface,et al.  Stream-based speaker segmentation using speaker factors and eigenvoices , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[3]  Jean-François Bonastre,et al.  The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Franz Pernkopf,et al.  Effective metric-based speaker segmentation in the frequency domain , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Belkacem Fergani,et al.  Speaker diarization using one-class support vector machines , 2008, Speech Commun..

[6]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[8]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[9]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Thomas Fang Zheng,et al.  Session Variability Subspace Projection Based Model Compensation for Speaker Verification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Delphine Charlet,et al.  A correlation metric for speaker tracking using anchor models , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  Aaron E. Rosenberg,et al.  Detection of target speakers in audio databases , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[13]  Pavel Matejka,et al.  Phonotactic language identification using high quality phoneme recognition , 2005, INTERSPEECH.

[14]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Jean-François Bonastre,et al.  E-HMM approach for learning and adapting sound models for speaker indexing , 2001, Odyssey.

[16]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[17]  Christian A. Müller,et al.  Fusing short term and long term features for improved speaker diarization , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Herbert Gish,et al.  Segregation of speakers for speech recognition and speaker identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.