Speech-music discrimination using deep visual feature extractors

Abstract Speech music discrimination is a traditional task in audio analytics, useful for a wide range of applications, such as automatic speech recognition and radio broadcast monitoring, that focuses on segmenting audio streams and classifying each segment as either speech or music. In this paper we investigate the capabilities of Convolutional Neural Networks (CNNs) with regards to the speech - music discrimination task. Instead of representing the audio content using handcrafted audio features, as traditional methods do, we use deep structures to learn visual feature dependencies as they appear on the spectrogram domain (i.e. train a CNN using audio spectrograms as input images). The main contribution of our work focuses on the potentials of using pre-trained deep architectures along with transfer-learning to train robust audio classifiers for the particular task of speech music discrimination. We highlight the supremacy of the proposed methods, compared both to the typical audio-based and deep-learning methods that adopt handcrafted features, and we evaluate our system in terms of classification success and run-time execution. To our knowledge this is the first work that investigates CNNs for the task of speech music discrimination and the first that exploits transfer learning across very different domains for audio modeling using deep-learning in general. In particular, we fine-tune a deep architecture originally trained for the Imagenet classification task, using a relatively small amount of data (almost 80 min of training audio samples) along with data augmentation. We evaluate our system through extensive experimentation against three different datasets. Firstly we experiment on a real-world dataset of more than 10 h of uninterrupted radio broadcasts and secondly, for comparison purposes, we evaluate our best method on two publicly available datasets that were designed specifically for the task of speech-music discrimination. Our results indicate that CNNs can significantly outperform current state-of-the-art in terms of performance especially when transfer learning is applied, in all three test-datasets. All the discussed methods, along with the whole experimental setup and the respective datasets, are openly provided for reproduction and further experimentation.

[1]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Hervé Bourlard,et al.  Speech/music segmentation using entropy and dynamism features in a HMM classification framework , 2003, Speech Commun..

[4]  Danilo Comminiello,et al.  Music classification using extreme learning machines , 2013, 2013 8th International Symposium on Image and Signal Processing and Analysis (ISPA).

[5]  Thomas Grill,et al.  Music boundary detection using neural networks on spectrograms and self-similarity lag matrices , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[6]  Young-Tae Kim,et al.  Speech Music Discrimination Using an Ensemble of Biased Classifiers , 2015 .

[7]  Michael J. Carey,et al.  A comparison of features for speech, music discrimination , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[8]  Sergios Theodoridis,et al.  Speech-music discrimination: A deep learning perspective , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[9]  Fillia Makedon,et al.  Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition , 2017, Comput..

[10]  George Tzanetakis,et al.  MARSYAS: a framework for audio analysis , 1999, Organised Sound.

[11]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[12]  N. O'Connor,et al.  Rhythm detection for speech-music discrimination in MPEG compressed domain , 2002, 2002 14th International Conference on Digital Signal Processing Proceedings. DSP 2002 (Cat. No.02TH8628).

[13]  Georgios Tziritas,et al.  A speech/music discriminator based on RMS and zero-crossings , 2005, IEEE Transactions on Multimedia.

[14]  Keikichi Hirose,et al.  Spectrogram based features selection using multiple kernel learning for speech/music discrimination , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Daniel P. W. Ellis,et al.  Speech/music discrimination based on posterior probability features , 1999, EUROSPEECH.

[16]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[17]  Edilson de Aguiar,et al.  Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order , 2017, Pattern Recognit..

[18]  Li Deng,et al.  A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Irina Illina,et al.  A wavelet-based parameterization for speech/music discrimination , 2010, Comput. Speech Lang..

[20]  Sergios Theodoridis,et al.  A dynamic programming approach to speech/music discrimination of radio recordings , 2007, 2007 15th European Signal Processing Conference.

[21]  Peter Kabal,et al.  Speech/music discrimination for multimedia applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[22]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[23]  Gregory Sell,et al.  Music tonality features for speech/music discrimination , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[26]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Rajib Sharma,et al.  Speech vs music discrimination using Empirical Mode Decomposition , 2015, 2015 Twenty First National Conference on Communications (NCC).

[28]  Lorenzo Rosasco,et al.  A deep representation for invariance and music classification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Yongzhao Zhan,et al.  Speech Emotion Recognition Using CNN , 2014, ACM Multimedia.

[30]  Xiaoli Li,et al.  Deep Convolutional Neural Networks on Multichannel Time Series for Human Activity Recognition , 2015, IJCAI.

[31]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[32]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[33]  Noel E. O'Connor,et al.  Speech-music discrimination from MPEG-1 bitstream , 2001 .

[34]  Theodoros Giannakopoulos pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis , 2015, PloS one.

[35]  Qiong Wu,et al.  A combination of data mining method with decision trees building for speech/music discrimination , 2010, INTERSPEECH.

[36]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Juhan Nam,et al.  SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification , 2018 .

[38]  Sergios Theodoridis,et al.  A Speech/Music Discriminator for Radio Recordings Using Bayesian Networks , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.