Automatic Audio Classification and Speaker Identification for Video Content Analysis

Recently, more literatures proposed to apply audio content analysis techniques in content-based video parsing. This paper presents our works on audio classification and speaker identification techniques for video content analysis. Firstly, soundtrack extracted from video stream is partitioned into homogeneous segments using rule and support vector machine(SVM) based classifier. Secondly, fixed-length speech clips randomly selected from speech segments are clustered into several clusters based on spectral clustering techniques. The clustered speech feature datasets initialize and train Gaussian mixture model(GMM) for each speaker. Finally, the trained GMMs accomplish speaker identification. Experimental results confirm the validity of the proposed scheme.

[1]  Stuart E. Dreyfus,et al.  An Appraisal of Some Shortest-Path Algorithms , 1969, Oper. Res..

[2]  G.R. Doddington,et al.  Speaker recognition—Identifying people by their voices , 1985, Proceedings of the IEEE.

[3]  Donald Goldfarb,et al.  An O(nm)-Time Network Simplex Algorithm for the Shortest Path Problem , 1999, Oper. Res..

[4]  Hao Jiang,et al.  Video segmentation with the Support of Audio Segmentation and classification , 2000 .

[5]  Andrew V. Goldberg,et al.  Shortest paths algorithms: Theory and experimental evaluation , 1994, SODA '94.

[6]  Gao Wen,et al.  Automatic segmentation of news items based on video and audio features , 2002 .

[7]  Michael I. Jordan,et al.  Learning Spectral Clustering, With Application To Speech Separation , 2006, J. Mach. Learn. Res..

[8]  Matthew Cooper,et al.  Summarizing popular music via structural similarity analysis , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[9]  Ariel Orda,et al.  Distributed shortest-path protocols for time-dependent networks , 1996, Distributed Computing.

[10]  Ce Wang,et al.  Automatic story segmentation of news video based on audio-visual features and text information , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[11]  Donald M. Topkis,et al.  A k shortest path algorithm for adaptive routing in communications networks , 1988, IEEE Trans. Commun..

[12]  Yu Cao,et al.  Audio-Assisted Scene Segmentation for Story Browsing , 2003, CIVR.

[13]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[14]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[15]  Shai Fine,et al.  A hybrid GMM/SVM approach to speaker identification , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[16]  Robert L. Smith,et al.  Fastest Paths in Time-dependent Networks for Intelligent Vehicle-Highway Systems Application , 1993, J. Intell. Transp. Syst..

[17]  Yi-Ping Phoebe Chen,et al.  Highlights for more complete sports video summarization , 2004, IEEE MultiMedia.

[18]  A. W. Zhimin,et al.  The model and algorithm for finding the optimal route in a dynamic road network , 2003, Proceedings of the 2003 IEEE International Conference on Intelligent Transportation Systems.

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  Soondal Park,et al.  Shortest paths in a network with time-dependent flow speeds , 1998, Eur. J. Oper. Res..

[21]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[22]  Thomas H. Spencer,et al.  Time-Work Tradeoffs of the Single-Source Shortest Paths Problem , 1999, J. Algorithms.

[23]  Friedhelm Meyer auf der Heide,et al.  Shortest-Path Routing in Arbitrary Networks , 1999, J. Algorithms.

[24]  Hao Jiang,et al.  Video segmentation with the assistance of audio content analysis , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[25]  Douglas A. Reynolds,et al.  A Gaussian mixture modeling approach to text-independent speaker identification , 1992 .

[26]  Ariel Orda,et al.  Shortest-path and minimum-delay algorithms in networks with time-dependent edge-length , 1990, JACM.

[27]  Lie Lu,et al.  Content-based audio segmentation using support vector machines , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[28]  Daniele Frigioni,et al.  Fully Dynamic Algorithms for Maintaining Shortest Paths Trees , 2000, J. Algorithms.

[29]  Beiqian Dai,et al.  Improving speaker verification with figure of merit training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Hsin-Min Wang,et al.  Automatic singer recognition of popular music recordings via estimation and modeling of solo vocal signals , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Ying Li,et al.  Instructional Video Content Analysis Using Audio Information , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Mikkel Thorup Floats, Integers, and Single Source Shortest Paths , 2000, J. Algorithms.

[33]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Lie Lu,et al.  Digital Object Identifier (DOI) 10.1007/s00530-002-0065-0 Multimedia Systems , 2003 .

[35]  Changsheng Xu,et al.  Automatic music classification and summarization , 2005, IEEE Transactions on Speech and Audio Processing.

[36]  Bai Liang,et al.  Feature analysis and extraction for audio automatic classification , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[37]  Jonathan Foote,et al.  Media segmentation using self-similarity decomposition , 2003, IS&T/SPIE Electronic Imaging.

[38]  Yu Cao,et al.  Parsing and browsing tools for colonoscopy videos , 2004, MULTIMEDIA '04.

[39]  Joemon M. Jose,et al.  An Audio-Based Sports Video Segmentation and Event Detection Algorithm , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[40]  Mohan S. Kankanhalli,et al.  Creating audio keywords for event detection in soccer video , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[41]  Yihong Gong,et al.  Automatic parsing of news video , 1994, 1994 Proceedings of IEEE International Conference on Multimedia Computing and Systems.