Spectrogram-Based Classification Of Spoken Foul Language Using Deep CNN

Excessive content of profanity in audio and video files has proven to shape one’s character and behavior. Currently, conventional methods of manual detection and censorship are being used. Manual censorship method is time consuming and prone to misdetection of foul language. This paper proposed an intelligent model for foul language censorship through automated and robust detection by deep convolutional neural networks (CNNs). A dataset of foul language was collected and processed for the computation of audio spectrogram images that serve as an input to evaluate the classification of foul language. The proposed model was first tested for 2-class (Foul vs Normal) classification problem, the foul class is then further decomposed into a 10-class classification problem for exact detection of profanity. Experimental results show the viability of proposed system by demonstrating high performance of curse words classification with 1.24-2.71 Error Rate (ER) for 2-class and 5.49-8.30 F1- score. Proposed Resnet50 architecture outperforms other models in terms of accuracy, sensitivity, specificity, F1-score.

[1]  John H. L. Hansen,et al.  A Review on Speech Recognition Technique , 2010 .

[2]  DeLiang Wang,et al.  Deep neural network based spectral feature mapping for robust speech recognition , 2015, INTERSPEECH.

[3]  Reinhold Häb-Umbach,et al.  A study on transfer learning for acoustic event detection in a real life scenario , 2017, 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP).

[4]  Sung Wook Baik,et al.  Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network , 2017, 2017 International Conference on Platform Technology and Service (PlatCon).

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Faramarz Sadeghi,et al.  Designing and Implementing of Intelligent Emotional Speech Recognition with Wavelet and Neural Network , 2016 .

[7]  H. Abdul Karim,et al.  Acoustic Pornography Recognition Using Recurrent Neural Network , 2019, 2019 IEEE International Conference on Signal and Image Processing Applications (ICSIPA).

[8]  R. Chengalvarayan Hierarchial subband linear predictive cepstral features for HMM-based speech recognition , 2002 .

[9]  P. Malathi,et al.  Speaker dependent speech emotion recognition using MFCC and Support Vector Machine , 2016, 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT).

[10]  John H. L. Hansen,et al.  Acoustic Scene Classification Using a CNN-SuperVector System Trained with Auditory and Spectrogram Image Features , 2017, INTERSPEECH.

[11]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Lars Lundberg,et al.  Classifying environmental sounds using image recognition networks , 2017, KES.

[13]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[14]  Laurent Girin,et al.  Assessing the performances of different neural network architectures for the detection of screams and shouts in public transportation , 2019, Expert Syst. Appl..

[15]  Hedvig Kjellström,et al.  Audio-visual classification and detection of human manipulation actions , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  H. A. Karim,et al.  Acoustic Pornography Recognition using Fused Pitch and Mel-Frequency Cepstrum Coefficients , 2019 .

[17]  Ying Wei,et al.  Specific two words lexical semantic recognition based on the wavelet transform of narrowband spectrogram , 2017, 2017 First International Conference on Electronics Instrumentation & Information Systems (EIIS).

[18]  Yan Song,et al.  Robust sound event recognition using convolutional neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  A. Tanju Erdem,et al.  Use of Line Spectral Frequencies for Emotion Recognition from Speech , 2010, 2010 20th International Conference on Pattern Recognition.

[21]  Justin Salamon,et al.  Unsupervised feature learning for urban sound classification , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Vinícius M. A. de Souza,et al.  Spoken Digit Recognition in Portuguese Using Line Spectral Frequencies , 2012, IBERAMIA.

[23]  Xavier Serra,et al.  Freesound technical demo , 2013, ACM Multimedia.

[24]  Zenghui Wang,et al.  Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review , 2017, Neural Computation.

[25]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Yu Zhang,et al.  Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[28]  Joon Huang Chuah,et al.  Spoken Arabic Digits Recognition Using Deep Learning , 2019, 2019 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS).

[29]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Rajesh Saha,et al.  Novel Feature Extraction Algorithm using DWT and Temporal Statistical Techniques for Word Dependent Speaker’s Recognition , 2018, 2018 Fourth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN).

[31]  Michael Skinner,et al.  Acoustic Characteristics of Emotional Speech Using Spectrogram Image Classification , 2018, 2018 12th International Conference on Signal Processing and Communication Systems (ICSPCS).

[32]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33]  E Chandra,et al.  A Review on Automatic Speech Recognition Architecture and Approaches , 2016 .