On the use of voice activity detection in speech emotion recognition

Emotion recognition through speech has many potential applications, however the challenge comes from achieving a high emotion recognition while using limited resources or interference such as noise. In this paper we have explored the possibility of improving speech emotion recognition by utilizing the voice activity detection (VAD) concept. The emotional voice data from the Berlin Emotion Database (EMO-DB) and a custom-made database LQ Audio Dataset are firstly preprocessed by VAD before feature extraction. The features are then passed to the deep neural network for classification. In this paper, we have chosen MFCC to be the sole determinant feature. From the results obtained using VAD and without, we have found that the VAD improved the recognition rate of 5 emotions (happy, angry, sad, fear, and neutral) by 3.7% when recognizing clean signals, while the effect of using VAD when training a network with both clean and noisy signals improved our previous results by 50%.

[1]  Junzo Watada,et al.  Speech Recognition in a Multi-speaker Environment by Using Hidden Markov Model and Mel-frequency Approach , 2016, 2016 Third International Conference on Computing Measurement Control and Sensor Network (CMCSN).

[2]  Malay Kishore Dutta,et al.  An automatic emotion recognizer using MFCCs and Hidden Markov Models , 2015, 2015 7th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT).

[3]  Junlei Song,et al.  Research on Digital Hearing Aid Speech Enhancement Algorithm , 2018, 2018 37th Chinese Control Conference (CCC).

[4]  P. Malathi,et al.  Speaker dependent speech emotion recognition using MFCC and Support Vector Machine , 2016, 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT).

[5]  Teddy Surya Gunawan,et al.  Speech Emotion Recognition Using Deep Feedforward Neural Network , 2018 .

[6]  Saikat Basu,et al.  Emotion recognition from speech using convolutional neural network with recurrent neural network architecture , 2017, 2017 2nd International Conference on Communication and Electronics Systems (ICCES).

[7]  Björn W. Schuller,et al.  Sentiment analysis using image-based deep spectrum features , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW).

[8]  Malaya Kumar Hota,et al.  A Study of Speech, Speaker and Emotion Recognition Using Mel Frequency Cepstrum Coefficients and Support Vector Machines , 2018, 2018 International Conference on Communication and Signal Processing (ICCSP).

[9]  Mira Kartiwi,et al.  A Review on Emotion Recognition Algorithms using Speech Analysis , 2018, Indonesian Journal of Electrical Engineering and Informatics (IJEEI).

[10]  Sunil Kumar Kopparapu,et al.  An Unsupervised frame Selection Technique for Robust Emotion Recognition in Noisy Speech , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[11]  Kishor B. Bhangale,et al.  Sound based human emotion recognition using MFCC & multiple SVM , 2017, 2017 International Conference on Information, Communication, Instrumentation and Control (ICICIC).

[12]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[14]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[15]  Ibrahim M. Baggili,et al.  WhatsApp network forensics: Decrypting and understanding the WhatsApp call signaling messages , 2015, Digit. Investig..

[16]  P. A. Bustamante,et al.  Recognition and regionalization of emotions in the arousal-valence plane , 2015, 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[17]  S. Lalitha,et al.  Personality Identification Using Auditory Nerve Modelling of Human Speech , 2018, 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[18]  Nasser Kehtarnavaz,et al.  A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection , 2018, IEEE Access.

[19]  Malay Kishore Dutta,et al.  Speech emotion recognition with deep learning , 2017, 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN).

[20]  W. Q. Ong,et al.  Robust voice activity detection using gammatone filtering and entropy , 2016, 2016 International Conference on Robotics, Automation and Sciences (ICORAS).