Automatic segmentation of speakers in broadcast audio material

In this paper, dimension-reduced, decorrelated spectral features for general sound recognition are applied to segment conversational speech of both broadcast news audio and panel discussion television programs. Without a priori information about number of speakers, the audio stream is segmented by a hybrid metric-based and model-based segmentation algorithm. For the measure of the performance we compare the segmentation results of the hybrid method versus metric-based segmentation with both the MPEG-7 standardized features and Mel-scale Frequency Cepstrum Coefficients (MFCC). Results show that the MFCC features yield better performance compared to MPEG-7 features. The hybrid approach significantly outperforms direct metric based segmentation.

[1]  Francine Chen,et al.  Segmentation of speech using speaker identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Thomas Sikora,et al.  Speaker recognition using MPEG-7 descriptors , 2003, INTERSPEECH.

[3]  Larry P. Heck,et al.  Speaker tracking and detection with multiple speakers , 1999, EUROSPEECH.

[4]  Herbert Gish,et al.  Clustering speakers by their voices , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Christian Wellekens,et al.  DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[6]  Douglas A. Reynolds,et al.  Blind clustering of speech utterances based on speaker and language characteristics , 1998, ICSLP.

[7]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[8]  Alexander H. Waibel,et al.  Strategies for automatic segmentation of audio data , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  J. A. López del Val,et al.  Principal Components Analysis , 2018, Applied Univariate, Bivariate, and Multivariate Statistics Using Python.

[10]  Lie Lu,et al.  Speaker change detection and tracking in real-time news broadcasting analysis , 2002, MULTIMEDIA '02.

[11]  G. Dunteman Principal Components Analysis , 1989 .

[12]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[13]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[14]  B. S. Manjunath,et al.  Introduction to mpeg-7 , 2002 .

[15]  Herbert Gish,et al.  Segregation of speakers for speech recognition and speaker identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[16]  Peng Yu,et al.  An improved model-based speaker segmentation system , 2003, INTERSPEECH.

[17]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[18]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[19]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[20]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[21]  M. Casey,et al.  MPEG-7 sound-recognition tools , 2001, IEEE Trans. Circuits Syst. Video Technol..