Step-by-step and integrated approaches in broadcast news speaker diarization

This paper summarizes the collaboration of the LIA and CLIPS laboratories on speaker diarization of broadcast news during the spring NIST Rich Transcription 2003 evaluation campaign (NIST-RTO03S). The speaker diarization task consists of segmenting a conversation into homogeneous segments which are then grouped into speaker classes. Two approaches are described and compared for speaker diarization. The first one relies on a classical two-step speaker diarization strategy based on a detection of speaker turns followed by a clustering process, while the second one uses an integrated strategy where both segment boundaries and speaker tying of the segments are extracted simultaneously and challenged during the whole process. These two methods are used to investigate various strategies for the fusion of diarization results. Furthermore, segmentation into acoustic macro-classes is proposed and evaluated as a priori step to speaker diarization. The objective is to take advantage of the a priori acoustic information in the diariza-tion process. Along with enriching the resulting segmentation with information about speaker gender,

[1]  Jean-François Bonastre,et al.  E-HMM approach for learning and adapting sound models for speaker indexing , 2001, Odyssey.

[2]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[3]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[4]  Sylvain Meignier,et al.  The ELISA consortium approaches in speaker segmentation during the NIST 2002 speaker recognition evaluation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[5]  Francine Chen,et al.  Segmentation of speech using speaker identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Bing Xiang,et al.  Light supervision in acoustic model training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Gérard Chollet,et al.  The ELISA Systems for the NIST"99 Evaluation in Speaker Detection and Tracking , 1999 .

[8]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[9]  Douglas A. Reynolds,et al.  The lincoln speaker recognition system: NIST eval2000 , 2000, INTERSPEECH.

[10]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[11]  Christian Wellekens,et al.  DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[12]  Georges Quénot,et al.  CLIPS at TRECVID : Shot Boundary Detection and Feature Detection , 2003, TRECVID.

[13]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[14]  Laurent Besacier,et al.  Using a priori information for speaker diarization , 2004, Odyssey.

[15]  Hynek Hermansky,et al.  A new speaker change detection method for two-speaker segmentation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Guillaume Gravier,et al.  Overview of the 2000-2001 ELISA Consortium research activities , 2001, Odyssey.

[17]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[18]  Jean-François Bonastre,et al.  The NIST 2004 spring rich transcription evaluation : two-axis merging strategy in the context of multiple distance microphone based meeting speaker segmentation , 2004 .

[19]  Jesper Ø. Olsen ICSLP'98 : Proceedings of the 5th International Conference on Spoken Language Processing, November 30-December 4, 1998, Sydney, Australia , 1998 .

[20]  H. Gish,et al.  An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Jean-Luc Gauvain,et al.  Audio Partitioning and Transcription for Broadcast Data Indexation , 2004, Multimedia Tools and Applications.

[22]  Thomas Hain,et al.  Recent advances in broadcast news transcription , 2003 .

[23]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[24]  Jean-François Bonastre,et al.  The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Lynn Wilcox,et al.  Audio indexing using speaker identification , 1994, Optics & Photonics.

[26]  Philip C. Woodland,et al.  The development of the HTK Broadcast News transcription system: An overview , 2002, Speech Commun..

[27]  Georges Quénot,et al.  CLIPS at TREC 11: Experiments in Video Retrieval , 2002, TREC.

[28]  Thomas Hain,et al.  Segmentation and classification of broadcast news audio , 1998, ICSLP.