论文信息 - Pitch-density-based features and an SVM binary tree approach for multi-class audio classification in broadcast news

Pitch-density-based features and an SVM binary tree approach for multi-class audio classification in broadcast news

Audio classification is an essential task in multimedia content analysis, which is a prerequisite to a variety of tasks such as segmentation, indexing and retrieval. This paper describes our study on multi-class audio classification on broadcast news, a popular multimedia repository with rich audio types. Motivated by the tonal regulations of music, we propose two pitch-density-based features, namely average pitch-density (APD) and relative tonal power density (RTPD). We use an SVM binary tree (SVM-BT) to hierarchically classify an audio clip into five classes: pure speech, music, environment sound, speech with music and speech with environment sound. Since SVM is a binary classifier, we use the SVM-BT architecture to realize coarse-to-fine multi-class classification with high accuracy and efficiency. Experiments show that the proposed one-dimensional APD and RTPD features are able to achieve comparable accuracy with popular high-dimensional features in speech/music discrimination, and the SVM-BT approach demonstrates superior performance in multi-class audio classification. With the help of the pitch-density-based features, we can achieve a high average accuracy of 94.2% in the five-class audio classification task.

[1] Ishwar K. Sethi,et al. Classification of general audio data for content-based retrieval , 2001, Pattern Recognit. Lett..

[2] Lei Xie,et al. A Two-Stage Multi-Feature Integration Approach to Unsupervised Speaker Change Detection in Real-Time News Broadcasting , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[3] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4] Malcolm Slaney,et al. Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5] Georgios Tziritas,et al. A speech/music discriminator based on RMS and zero-crossings , 2005, IEEE Transactions on Multimedia.

[6] Lie Lu,et al. Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[7] Ling Guan,et al. Semantic Retrieval of Multimedia , 2006 .

[8] C.-C. Jay Kuo,et al. Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[9] Sergios Theodoridis,et al. A Speech/Music Discriminator of Radio Recordings Based on Dynamic Programming and Bayesian Networks , 2008, IEEE Transactions on Multimedia.

[10] Jeroen Breebaart,et al. Features for audio and music classification , 2003, ISMIR.

[11] Wei Yang,et al. Fast neighborhood component analysis , 2012, Neurocomputing.

[12] Jason Weston,et al. Multi-Class Support Vector Machines , 1998 .

[13] D.P. Skinner,et al. The cepstrum: A guide to processing , 1977, Proceedings of the IEEE.

[14] Ling Guan,et al. Semantic retrieval of multimedia [from the Guest Editors] , 2006 .

[15] Douglas E. Sturim,et al. Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[16] Hyon-Soo Lee,et al. Speech/Music Discrimination using Spectral Peak Feature for Speaker Indexing , 2006, 2006 International Symposium on Intelligent Signal Processing and Communications.

[17] Ying Li,et al. SVM-based audio classification for instructional video analysis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[19] Zhi-Qiang Liu,et al. Self-Validated Labeling of Markov Random Fields for Image Segmentation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20] Hwa Jeon Song,et al. Speech/Music Discrimination for Robust Speech Recognition in Robots , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.

[21] Michael J. Carey,et al. A comparison of features for speech, music discrimination , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[22] Günther Palm,et al. The GMM-SVM Supervector Approach for the Recognition of the Emotional Status from Speech , 2009, ICANN.

[23] Wasfi G. Al-Khatib,et al. Machine-learning based classification of speech and music , 2006, Multimedia Systems.

[24] Lei Chen,et al. Mixed Type Audio Classification with Support Vector Machine , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[25] Bo Xu,et al. SVM-based audio scene classification , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[26] Susanto Rahardja,et al. Detecting Musical Sounds in Broadcast Audio Based on Pitch Tuning Analysis , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[27] Qiong Wu,et al. A combination of data mining method with decision trees building for speech/music discrimination , 2010, INTERSPEECH.

[28] Chuan Liu,et al. Classification of Music and Speech in Mandarin News Broadcasts , 2007 .

[29] Robert Tibshirani,et al. Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[30] David Gerhard,et al. Pitch Extraction and Fundamental Frequency: History and Current Techniques , 2003 .

[31] Lei Xie,et al. Discovering salient prosodic cues and their interactions for automatic story segmentation in Mandarin broadcast news , 2008, Multimedia Systems.

[32] Lie Lu,et al. Digital Object Identifier (DOI) 10.1007/s00530-002-0065-0 Multimedia Systems , 2003 .

[33] Olga Veksler,et al. Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[34] Zhang Yanning. An Automatic Caption Generator for Mandarin Broadcast News , 2011 .

[35] Jun Wang,et al. Real-time speech/music classification with a hierarchical oblique decision tree , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36] Wen Gao,et al. A fast and robust speech/music discrimination approach , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[37] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[38] Soo-Young Lee,et al. Support Vector Machines with Binary Tree Architecture for Multi-Class Classification , 2004 .