A pitch-based rapid speech segmentation for speaker indexing

Segmentation of continuous audio is an important processing in many applications. In speaker indexing, the reliability of speaker model depends much on segmentation. Commonly used methods are based on the Bayesian information criteria (BIC), which is however not so capable when dealing with short utterances. In this paper, we present a pitch-based speech segmentation method, which can detect frequent speaker changes accurately and rapidly. In our algorithm, pitch is introduced in speaker segmentation. Firstly, utterance segments are detected by pitch. Then distances of pitch are computed, and compared with a self-adaptable threshold. Speaker changes are finally decided among utterance segments. We applied our method and three comparative methods on the HUB4-NE broadcast data. Speaker indexing experiments have been taken following each algorithm. We also suggested two indicators as complements of false alarm and missing rate in the evaluation of segmentation. The experiment results show that our algorithm works faster and better, with most of short time speaker changes detected. Speaker indexing equal error rate of our method is 10.43%, which is much lower than 12.94%, 25.84% and 15.91% of other methods.

[1]  Christian Wellekens,et al.  DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[2]  John H. L. Hansen,et al.  Advances in unsupervised audio segmentation for the broadcast news and NGSW corpora , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Aaron E. Rosenberg,et al.  A comparative performance study of several pitch detection algorithms , 1976 .

[4]  Ramesh A. Gopinath,et al.  Improved speaker segmentation and segments clustering using the bayesian information criterion , 1999, EUROSPEECH.

[5]  Hervé Bourlard,et al.  Robust speaker change detection , 2004, IEEE Signal Processing Letters.

[6]  O. Pietquin,et al.  Applied Clustering for Automatic Speaker-Based Segmentation of Audio Material , 2002 .

[7]  Zhaohui Wu,et al.  Enhance speaker segmentation by elaborating utterance detection , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[8]  M. J. Cheng,et al.  Comparative performance study of several pitch detection algorithms , 1975 .

[9]  Xuejing Sun,et al.  Pitch determination and voice quality analysis using Subharmonic-to-Harmonic Ratio , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[11]  Herbert Gish,et al.  Segregation of speakers for speech recognition and speaker identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[12]  Shrikanth S. Narayanan,et al.  A method for on-line speaker indexing using generic reference models , 2003, INTERSPEECH.

[13]  Douglas E. Sturim,et al.  Speaker indexing in large audio databases using anchor models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[14]  Masafumi Nishida,et al.  Speaker indexing for news articles, debates and drama in broadcasted TV programs , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.