Automatic Audio Classification and Speaker Identification for Video Content Analysis

Recently, more literatures proposed to apply audio content analysis techniques in content-based video parsing. This paper presents our works on audio classification and speaker identification techniques for video content analysis. Firstly, soundtrack extracted from video stream is partitioned into homogeneous segments using rule and support vector machine(SVM) based classifier. Secondly, fixed-length speech clips randomly selected from speech segments are clustered into several clusters based on spectral clustering techniques. The clustered speech feature datasets initialize and train Gaussian mixture model(GMM) for each speaker. Finally, the trained GMMs accomplish speaker identification. Experimental results confirm the validity of the proposed scheme.

[1]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[2]  Yihong Gong,et al.  Automatic parsing of news video , 1994, 1994 Proceedings of IEEE International Conference on Multimedia Computing and Systems.

[3]  Douglas A. Reynolds,et al.  A Gaussian mixture modeling approach to text-independent speaker identification , 1992 .

[4]  Lie Lu,et al.  Content-based audio segmentation using support vector machines , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[5]  Beiqian Dai,et al.  Improving speaker verification with figure of merit training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Yu Cao,et al.  Audio-Assisted Scene Segmentation for Story Browsing , 2003, CIVR.

[7]  Yu Cao,et al.  Parsing and browsing tools for colonoscopy videos , 2004, MULTIMEDIA '04.

[8]  Joemon M. Jose,et al.  An Audio-Based Sports Video Segmentation and Event Detection Algorithm , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[9]  Ce Wang,et al.  Automatic story segmentation of news video based on audio-visual features and text information , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[10]  Songyang Lao,et al.  Feature analysis and extraction for audio automatic classification , 2005, SMC.

[11]  Michael I. Jordan,et al.  Learning Spectral Clustering , 2003, NIPS.

[12]  Matthew Cooper,et al.  Summarizing popular music via structural similarity analysis , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[13]  Shai Fine,et al.  A hybrid GMM/SVM approach to speaker identification , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[14]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[15]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[16]  Lie Lu,et al.  Digital Object Identifier (DOI) 10.1007/s00530-002-0065-0 Multimedia Systems , 2003 .

[17]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Dian Tjondronegoro,et al.  Integrating Highlights for More Complete Sports Video Summarization , 2004 .

[19]  G.R. Doddington,et al.  Speaker recognition—Identifying people by their voices , 1985, Proceedings of the IEEE.

[20]  Hao Jiang,et al.  Video segmentation with the Support of Audio Segmentation and classification , 2000 .

[21]  Hsin-Min Wang,et al.  Automatic singer recognition of popular music recordings via estimation and modeling of solo vocal signals , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Ying Li,et al.  Instructional Video Content Analysis Using Audio Information , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Changsheng Xu,et al.  Automatic music classification and summarization , 2005, IEEE Transactions on Speech and Audio Processing.

[24]  Michael I. Jordan,et al.  Learning Spectral Clustering, With Application To Speech Separation , 2006, J. Mach. Learn. Res..

[25]  Wen Gao,et al.  Automatic Segmentation of News Items Based on Video and Audio Features , 2001, IEEE Pacific Rim Conference on Multimedia.

[26]  Mohan S. Kankanhalli,et al.  Creating audio keywords for event detection in soccer video , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[27]  Jonathan Foote,et al.  Media segmentation using self-similarity decomposition , 2003, IS&T/SPIE Electronic Imaging.