An Audio-Based Deep Learning Framework For BBC Television Programme Classification

This paper proposes a deep learning framework for classification of BBC television programmes using audio. The audio is firstly transformed into spectrograms, which are fed into a pre-trained Convolutional Neural Network (CNN), obtaining predicted probabilities of sound events occurring in the audio recording. Statistics for the predicted probabilities and detected sound events are then calculated to extract discriminative features representing the television programmes. Finally, the embedded features extracted are fed into a classifier for classifying the programmes into different genres. Our experiments are conducted over a dataset of 6,160 programmes belonging to nine genres labelled by the BBC. We achieve an average classification accuracy of 93.7% over 14-fold cross validation. This demonstrates the efficacy of the proposed framework for the task of audio-based classification of television programmes.

[1]  Alberto Messina,et al.  Automatic Genre Classification of TV Programmes Using Gaussian Mixture Models and Neural Networks , 2007, 18th International Workshop on Database and Expert Systems Applications (DEXA 2007).

[2]  T. Sikora,et al.  An automatic system for real-time video-genres detection using high-level-descriptors and a set of classifiers , 2008, 2008 IEEE International Symposium on Consumer Electronics.

[3]  Qi Tian,et al.  TV Commercial Classification by using Multi-Modal Textual Information , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[4]  Yongmin Li,et al.  Video classification using spatial-temporal features and PCA , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[5]  Ian McLoughlin,et al.  Robust Acoustic Scene Classification using a Multi-Spectrogram Encoder-Decoder Framework , 2020, Digit. Signal Process..

[6]  Mark D. Plumbley,et al.  PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[8]  Markus Koch,et al.  TubeFiler: an automatic web video categorizer , 2009, ACM Multimedia.

[9]  Ian McLoughlin,et al.  On Multitask Loss Function for Audio Event Detection and Localization , 2020, DCASE.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Rainer Stiefelhagen,et al.  Content-based video genre classification using multiple cues , 2010, AIEMPro '10.

[13]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[15]  Raymond W. M. Ng,et al.  Automatic Genre and Show Identification of Broadcast Media , 2016, INTERSPEECH.

[16]  Panayiotis G. Georgiou,et al.  On-line genre classification of TV programs using audio content , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Marcel Worring,et al.  Efficient Genre-Specific Semantic Video Indexing , 2012, IEEE Transactions on Multimedia.

[18]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[19]  Raymond W. M. Ng,et al.  Latent Dirichlet Allocation based organisation of broadcast media archives for deep neural network adaptation , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[20]  Ian McLoughlin,et al.  Deep Feature Embedding and Hierarchical Classification for Audio Scene Classification , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[21]  Tatsuya Harada,et al.  Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.

[22]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[23]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[24]  Haibo Mi,et al.  Mixup-Based Acoustic Scene Classification Using Multi-Channel Convolutional Neural Network , 2018, PCM.

[25]  Huy Phan,et al.  Improved Audio Scene Classification Based on Label-Tree Embeddings and Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.