Sound Scene Clustering without Prior Knowledge

In this paper, a way to classify and edit lengthy speech material in sound scene, including various sounds such as TV/radio or meeting and discussion speech, is discussed on the premise of no prior information hypothesis. The proposed method is comprised of two technical parts. One is voice clustering, based on the vector quantization (VQ) distortion as the criterion, and the other is automatic adaptive threshold estimation. This application is, especially, convenient and fast for creating the speech database or speech searching using the audio data. The experimental results show the F-measure value of 94.13% on the conversation speech 1.67 hours, and a clustering rate of 84.1% on speaker sound of the 30 minutes from phone calling voice including various noises. Without a training model and a short utterance, we conformed that the VQ distortion measure and the dynamic threshold estimating approach are easy to implement and convenient to one-set recording sound scene (category) clustering.

[1]  Daben Liu,et al.  Fast speaker change detection for broadcast news transcription and indexing , 1999, EUROSPEECH.

[2]  Herbert Gish,et al.  Segregation of speakers for speech recognition and speaker identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Herbert Gish,et al.  Clustering speakers by their voices , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Sadaoki Furui,et al.  A Study on Sound Sequence Segmentation and Evaluation Methods , 2000 .

[5]  Tatsuya Kawahara,et al.  Unsupervised speaker indexing of discussions using anchor models , 2005 .

[6]  Seiichi Nakagawa,et al.  Speaker change detection and speaker clustering using VQ distortion measure , 2003, Systems and Computers in Japan.