Pitch-density-based features and an SVM binary tree approach for multi-class audio classification in broadcast news

Audio classification is an essential task in multimedia content analysis, which is a prerequisite to a variety of tasks such as segmentation, indexing and retrieval. This paper describes our study on multi-class audio classification on broadcast news, a popular multimedia repository with rich audio types. Motivated by the tonal regulations of music, we propose two pitch-density-based features, namely average pitch-density (APD) and relative tonal power density (RTPD). We use an SVM binary tree (SVM-BT) to hierarchically classify an audio clip into five classes: pure speech, music, environment sound, speech with music and speech with environment sound. Since SVM is a binary classifier, we use the SVM-BT architecture to realize coarse-to-fine multi-class classification with high accuracy and efficiency. Experiments show that the proposed one-dimensional APD and RTPD features are able to achieve comparable accuracy with popular high-dimensional features in speech/music discrimination, and the SVM-BT approach demonstrates superior performance in multi-class audio classification. With the help of the pitch-density-based features, we can achieve a high average accuracy of 94.2% in the five-class audio classification task.

[1]  Ishwar K. Sethi,et al.  Classification of general audio data for content-based retrieval , 2001, Pattern Recognit. Lett..

[2]  Lei Xie,et al.  A Two-Stage Multi-Feature Integration Approach to Unsupervised Speaker Change Detection in Real-Time News Broadcasting , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[3]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Georgios Tziritas,et al.  A speech/music discriminator based on RMS and zero-crossings , 2005, IEEE Transactions on Multimedia.

[6]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[7]  Ling Guan,et al.  Semantic Retrieval of Multimedia , 2006 .

[8]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[9]  Sergios Theodoridis,et al.  A Speech/Music Discriminator of Radio Recordings Based on Dynamic Programming and Bayesian Networks , 2008, IEEE Transactions on Multimedia.

[10]  Jeroen Breebaart,et al.  Features for audio and music classification , 2003, ISMIR.

[11]  Wei Yang,et al.  Fast neighborhood component analysis , 2012, Neurocomputing.

[12]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[13]  D.P. Skinner,et al.  The cepstrum: A guide to processing , 1977, Proceedings of the IEEE.

[14]  Ling Guan,et al.  Semantic retrieval of multimedia [from the Guest Editors] , 2006 .

[15]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[16]  Hyon-Soo Lee,et al.  Speech/Music Discrimination using Spectral Peak Feature for Speaker Indexing , 2006, 2006 International Symposium on Intelligent Signal Processing and Communications.

[17]  Ying Li,et al.  SVM-based audio classification for instructional video analysis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[19]  Zhi-Qiang Liu,et al.  Self-Validated Labeling of Markov Random Fields for Image Segmentation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Hwa Jeon Song,et al.  Speech/Music Discrimination for Robust Speech Recognition in Robots , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.

[21]  Michael J. Carey,et al.  A comparison of features for speech, music discrimination , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[22]  Günther Palm,et al.  The GMM-SVM Supervector Approach for the Recognition of the Emotional Status from Speech , 2009, ICANN.

[23]  Wasfi G. Al-Khatib,et al.  Machine-learning based classification of speech and music , 2006, Multimedia Systems.

[24]  Lei Chen,et al.  Mixed Type Audio Classification with Support Vector Machine , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[25]  Bo Xu,et al.  SVM-based audio scene classification , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[26]  Susanto Rahardja,et al.  Detecting Musical Sounds in Broadcast Audio Based on Pitch Tuning Analysis , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[27]  Qiong Wu,et al.  A combination of data mining method with decision trees building for speech/music discrimination , 2010, INTERSPEECH.

[28]  Chuan Liu,et al.  Classification of Music and Speech in Mandarin News Broadcasts , 2007 .

[29]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  David Gerhard,et al.  Pitch Extraction and Fundamental Frequency: History and Current Techniques , 2003 .

[31]  Lei Xie,et al.  Discovering salient prosodic cues and their interactions for automatic story segmentation in Mandarin broadcast news , 2008, Multimedia Systems.

[32]  Lie Lu,et al.  Digital Object Identifier (DOI) 10.1007/s00530-002-0065-0 Multimedia Systems , 2003 .

[33]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Zhang Yanning An Automatic Caption Generator for Mandarin Broadcast News , 2011 .

[35]  Jun Wang,et al.  Real-time speech/music classification with a hierarchical oblique decision tree , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Wen Gao,et al.  A fast and robust speech/music discrimination approach , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[37]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[38]  Soo-Young Lee,et al.  Support Vector Machines with Binary Tree Architecture for Multi-Class Classification , 2004 .