In this study, we present a novel approach to the blind automatic segmentation of speakers in police interviews, combining the use of phonetic features like pitch and the statistical pattern recognition of short-term power spectrum features like Mel Frequency Cepstral Coefficients (MFCCs). This approach requires minimal user intervention and allows for easy segmentation of the speech of separate speakers from multi-speaker recordings. This approach can have significant benefit in the harvesting of the speech of a single speaker for use in phonetic and automatic speaker recognition as well as gleaning quick intelligence in surveillance recordings. We propose a two-tiered approach to speaker segmentation, the first using discontinuities in the pitch trajectories to identify potential speaker clusters, and the second using an iterative speaker assignment and training method based on Gaussian mixture models. This approach will be demonstrated using realistic and simulated police witness interviews. Proposed approach and test databases The pitch tracks for the voiced segments are extracted from the interview recording using the autocorrelation-based pitch tracker in Praat (Boersma, 1993). Based on discontinuities in the pitch track, we extract ‘zones of reliability’ for the identity of a speaker. A continuous ‘run’ of similar values in the pitch track provides such a zone of reliability and any significant discontinuities in the pitch track, either in time or frequency, is used to define a candidate transition point between speakers. These candidate transition points are then used to define clusters as illustrated in Figure 1a. We use the clusters with sufficient information to model potential speakers. A statistical model of each cluster is then compared to all other segments in order to get the most divergent pair of segments.
[1]
Youngmoo E. Kim,et al.
Joint Iterative Multi-Speaker Identification and Source Separation using Expectation Propagation
,
2007,
2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.
[2]
Christian Wellekens,et al.
DISTBIC: A speaker-based segmentation for audio data indexing
,
2000,
Speech Commun..
[3]
Francis Nolan,et al.
The DyViS database: style-controlled recordings of 100 homogeneous speakers for forensic phonetic research
,
2009
.
[4]
Richard M. Stern,et al.
Voting for two speaker segmentation
,
2006,
INTERSPEECH.
[5]
P. Boersma.
ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND
,
1993
.