A new approach of speaker clustering is presented and discussed in this paper. The main technique consists in grouping all the homogeneous speech segments obtained at the end of the segmentation process, by using the spatial information provided by the stereophonic speech. The proposed system is suitable for debates or multi-conferences for which the speakers are located at fixed positions. The new method uses the differential energy of the two stereophonic signals collected by two cardioid microphones, in order to group all the speech segments that are uttered by the same speaker. The total number of clusters obtained at the end should be equal to the real number of speakers present in the meeting room and each cluster should contain the global intervention of only one speaker. The new proposed approach (which we called Energy Differential based Spatial Clustering or EDSC) has been experimented comparatively with a classic statistical approach called "Mono-Gaussian Sequential Clustering". Experiments of speaker clustering are done on a stereophonic speech corpus called DB15, composed of 15 stereophonic scenarios of about 3.5 minutes each. Every scenario corresponds to a free discussion between several speakers seated at fixed positions in the meeting room. Results show the strong performances of the new approach in terms of precision and speed, especially for short speech segments.
[1]
Ivan Magrin-Chagnolleau,et al.
Second-order statistical measures for text-independent speaker identification
,
1995,
Speech Commun..
[2]
Iain McCowan,et al.
Clustering and segmenting speakers and their locations in meetings
,
2004,
2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[3]
Douglas A. Reynolds,et al.
Speaker diarisation for broadcast news
,
2004,
Odyssey.
[4]
D A Reynolds,et al.
The MIT Lincoln Laboratory RT-04F Diarization Systems: Applications to Broadcast Audio and Telephone Conversations
,
2004
.
[5]
Frédéric Bimbot,et al.
Effect of utterance duration and phonetic content on speaker identification using second-order statistical methods
,
1995,
EUROSPEECH.
[6]
Siham Ouamour,et al.
Automatic speaker tracking by camera using two-channel-based sound source localization
,
2011,
Int. J. Intell. Comput. Cybern..
[7]
Douglas E. Sturim,et al.
The MITLL NIST LRE 2015 Language Recognition System
,
2016,
Odyssey.
[8]
José Manuel Pardo,et al.
Robust Speaker Diarization for meetings
,
2006
.
[9]
Christian Wellekens,et al.
DISTBIC: A speaker-based segmentation for audio data indexing
,
2000,
Speech Commun..