Shouted and Normal Speech Classification Using 1D CNN

Automatic shouted speech detection systems usually model its spectral characteristics to differentiate it from normal speech. Mostly hand-crafted features have been explored for shouted speech detection. However, many works on audio processing suggest that approaches based on automatic feature learning are more robust than hand-crafted feature engineering. This work re-demonstrates this notion by proposing a 1D-CNN architecture for shouted and normal speech classification task. The CNN learns features from the magnitude spectrum of speech frames. Classification is performed by fully connected layers at later stages of the network. Performance of the proposed architecture is evaluated on three datasets and validated against three existing approaches. As an additional contribution, a discussion of features learned by the CNN kernels is provided with relevant visualizations.

[1]  Sung Wook Baik,et al.  Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network , 2017, 2017 International Conference on Platform Technology and Service (PlatCon).

[2]  Sébastien Ambellouis,et al.  Shout analysis and characterisation , 2019, Int. J. Speech Technol..

[3]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[4]  Bayya Yegnanarayana,et al.  An Automatic Shout Detection System Using Speech Production Features , 2014, MA3HMI@INTERSPEECH.

[5]  Milan Sigmund,et al.  Impact of vocal effort variability on automatic speech recognition , 2012, Speech Commun..

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Yongzhao Zhan,et al.  Speech Emotion Recognition Using CNN , 2014, ACM Multimedia.

[8]  Paavo Alku,et al.  Detection of shouted speech in noise: human and machine. , 2013, The Journal of the Acoustical Society of America.

[9]  V. K. Mittal,et al.  Effect of glottal dynamics in the production of shouted speech. , 2013, The Journal of the Acoustical Society of America.

[10]  Vasudeva Varma,et al.  Deep Learning for Hate Speech Detection in Tweets , 2017, WWW.

[11]  Paavo Alku,et al.  Analysis and synthesis of shouted speech , 2013, INTERSPEECH.

[12]  Gwenn Englebienne,et al.  Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition , 2017, ACM Multimedia.

[13]  Dimitri Palaz,et al.  Analysis of CNN-based speech recognition system using raw speech as input , 2015, INTERSPEECH.