Audio classification using attention-augmented convolutional neural network

Abstract Audio classification, as a set of important and challenging tasks, groups speech signals according to speakers’ identities, accents, and emotional states. Due to the high dimensionality of the audio data, task-specific hand-crafted features extraction is always required and regarded cumbersome for various audio classification tasks. More importantly, the inherent relationship among features has not been fully exploited. In this paper, the original speech signal is first represented as spectrogram and later be split along the frequency domain to form frequency-distributed spectrogram. This paper proposes a task-independent model, called FreqCNN, to automaticly extract distinctive features from each frequency band by using convolutional kernels. Further more, an attention mechanism is introduced to systematically enhance the features from certain frequency bands. The proposed FreqCNN is evaluated on three publicly available speech databases thorough three independent classification tasks. The obtained results demonstrate superior performance over the state-of-the-art.

[1]  Wootaek Lim,et al.  Speech emotion recognition using convolutional and Recurrent Neural Networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[2]  Jun Du,et al.  Hierarchical deep neural network for multivariate regression , 2017, Pattern Recognit..

[3]  Ling He,et al.  Time-frequency feature extraction from spectrograms and wavelet packets with application to automatic stress and emotion classification in speech , 2009, 2009 7th International Conference on Information, Communications and Signal Processing (ICICS).

[4]  Giovanni Costantini,et al.  Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure , 2014, Knowl. Based Syst..

[5]  Buket D. Barkana,et al.  Deep neural network framework and transformed MFCCs for speaker's age and gender classification , 2017, Knowl. Based Syst..

[6]  Zhouyu Fu,et al.  Optimizing Cepstral Features for Audio Classification , 2013, IJCAI.

[7]  Jun Du Irrelevant Variability Normalization via Hierarchical Deep Neural Networks for Online Handwritten Chinese Character Recognition , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[8]  Sridha Sridharan,et al.  i-vector Based Speaker Recognition on Short Utterances , 2011, INTERSPEECH.

[9]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[10]  Zhang Yi,et al.  Foundations of Implementing the Competitive Layer Model by Lotka–Volterra Recurrent Neural Networks , 2010, IEEE Transactions on Neural Networks.

[11]  John H. L. Hansen,et al.  Unsupervised accent classification for deep data fusion of accent and language information , 2016, Speech Commun..

[12]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[13]  Jürgen Schmidhuber,et al.  Deep Networks with Internal Selective Attention through Feedback Connections , 2014, NIPS.

[14]  Wen Gao,et al.  Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[15]  Javier Hernando,et al.  Deep belief networks for i-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Y. X. Zou,et al.  An experimental study of speech emotion recognition based on deep convolutional neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[17]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[18]  Danilo Comminiello,et al.  Benchmarking Functional Link Expansions for Audio Classification Tasks , 2016, Advances in Neural Networks.

[19]  Hervé Bourlard,et al.  Subband-based speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[21]  Masato Akagi,et al.  Toward affective speech-to-speech translation: Strategy for emotional speech recognition and synthesis in multiple languages , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Simon Lucey,et al.  Convolutional Sparse Coding for Trajectory Reconstruction , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Tze Fen Li,et al.  Speech recognition of mandarin syllables using both linear predict coding cepstra and Mel frequency cepstra , 2007, ROCLING/IJCLCLP.

[26]  Dinggang Shen,et al.  A Robust Deep Model for Improved Classification of AD/MCI Patients , 2015, IEEE Journal of Biomedical and Health Informatics.

[27]  Kah Phooi Seng,et al.  A new approach of audio emotion recognition , 2014, Expert Syst. Appl..

[28]  Yan Leng,et al.  Employing unlabeled data to improve the classification performance of SVM, and its application in audio event classification , 2016, Knowl. Based Syst..

[29]  Mina Ibrahim,et al.  Improved text-independent speaker identification system for real time applications , 2016, 2016 Fourth International Japan-Egypt Conference on Electronics, Communications and Computers (JEC-ECC).

[30]  Ran Chong-sen Speech Enhancement Using Sub-band Spectral Analysis , 2006 .

[31]  Masakiyo Fujimoto,et al.  Exploiting spectro-temporal locality in deep learning based acoustic event detection , 2015, EURASIP J. Audio Speech Music. Process..

[32]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[33]  Margaret Lech,et al.  Towards real-time Speech Emotion Recognition using deep neural networks , 2015, 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS).

[34]  Ronald A. Rensink The Dynamic Representation of Scenes , 2000 .

[35]  Erik Cambria,et al.  Towards an intelligent framework for multimodal affective data analysis , 2015, Neural Networks.

[36]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[37]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.