Vocal-based emotion recognition using random forests and decision tree

This paper proposes a new vocal-based emotion recognition method using random forests, where pairs of the features on the whole speech signal, namely, pitch, intensity, the first four formants, the first four formants bandwidths, mean autocorrelation, mean noise-to-harmonics ratio and standard deviation, are used in order to recognize the emotional state of a speaker. The proposed technique adopts random forests to represent the speech signals, along with the decision-trees approach, in order to classify them into different categories. The emotions are broadly categorised into the six groups, which are happiness, fear, sadness, neutral, surprise, and disgust. The Surrey Audio-Visual Expressed Emotion database is used. According to the experimental results using leave-one-out cross-validation, by means of combining the most significant prosodic features, the proposed method has an average recognition rate of $$66.28\%$$66.28%, and at the highest level, the recognition rate of $$78\%$$78% has been obtained, which belongs to the happiness voice signals. The proposed method has $$13.78\%$$13.78% higher average recognition rate and $$28.1\%$$28.1% higher best recognition rate compared to the linear discriminant analysis as well as $$6.58\%$$6.58% higher average recognition rate than the deep neural networks results, both of which have been implemented on the same database.

[1]  Alexander H. Waibel,et al.  Natural human-robot interaction using speech, head pose and gestures , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[2]  Yuxiao Hu,et al.  Audio-Visual Spontaneous Emotion Recognition , 2007, Artifical Intelligence for Human Computing.

[3]  Günther Palm,et al.  Towards Emotion Recognition in Human Computer Interaction , 2012, WIRN.

[4]  Kyu-Sik Park,et al.  A Study of Emotion Recognition and Its Applications , 2007, MDAI.

[5]  Lori Lamel,et al.  Challenges in real-life emotion annotation and machine learning based detection , 2005, Neural Networks.

[6]  安藤 寛,et al.  Cross-Validation , 1952, Encyclopedia of Machine Learning and Data Mining.

[7]  Amit Sharma,et al.  Speech Emotion Recognition , 2015 .

[8]  David Deterding,et al.  The Formants of Monophthong Vowels in Standard Southern British English Pronunciation , 1997, Journal of the International Phonetic Association.

[9]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[10]  Laurence Devillers,et al.  Real-life emotions detection with lexical and paralinguistic cues on human-human call center dialogs , 2006, INTERSPEECH.

[11]  Björn W. Schuller,et al.  Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing , 2007, ACII.

[12]  Anna Esposito,et al.  Analysis of high-level features for vocal emotion recognition , 2011, 2011 34th International Conference on Telecommunications and Signal Processing (TSP).

[13]  Kiavash Bahreini,et al.  FILTWAM and Voice Emotion Recognition , 2013, GALA.

[14]  Mann Oo. Hay Emotion recognition in human-computer interaction , 2012 .

[15]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[17]  Changyin Sun,et al.  Facial Expression Recognition Based on BoostingTree , 2006, ISNN.

[18]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[19]  Radim Burget,et al.  Recognition of Emotions in Czech Newspaper Headlines , 2011 .

[20]  A. Pribilová,et al.  Determination of Formant Features in Czech and Slovak for GMM Emotional Speech Classifier Ji ř , 2013 .

[21]  Klaus Nordhausen,et al.  Ensemble Methods: Foundations and Algorithms by Zhi‐Hua Zhou , 2013 .

[22]  Johan Sundberg,et al.  Comparing the acoustic expression of emotion in the speaking and the singing voice , 2015, Comput. Speech Lang..

[23]  Adam Pelikant,et al.  Recognition of Human Emotion from a Speech Signal Based on Plutchik's Model , 2012 .

[24]  Valery A. Petrushin,et al.  Emotion recognition in speech signal: experimental study, development, and application , 2000, INTERSPEECH.

[25]  Theodoros Iliou,et al.  Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 , 2012, Artificial Intelligence Review.

[26]  Björn W. Schuller,et al.  Towards More Reality in the Recognition of Emotional Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[27]  J. Townsend Theoretical analysis of an alphabetic confusion matrix , 1971 .

[28]  Shiqing Zhang,et al.  Speech Emotion Recognition Using an Enhanced Kernel Isomap for Human-Robot Interaction , 2013 .

[29]  S. Gaulin,et al.  Men's voices as dominance signals: vocal fundamental and formant frequencies influence dominance attributions among men , 2007 .

[30]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[31]  Klaus R. Scherer,et al.  Vocal markers of emotion: Comparing induction and acting elicitation , 2013, Comput. Speech Lang..

[32]  James D. Edge,et al.  Audio-visual feature selection and reduction for emotion classification , 2008, AVSP.

[33]  Jacqueline Laures-Gore,et al.  Acoustic-perceptual correlates of voice quality in elderly men and women. , 2006, Journal of communication disorders.

[34]  M. Borchert,et al.  Emotions in speech - experiments with prosody and quality features in speech for use in categorical and dimensional emotion recognition environments , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[35]  Margaret Lech,et al.  Towards real-time Speech Emotion Recognition using deep neural networks , 2015, 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS).

[36]  Johannes Wagner,et al.  Automatic Recognition of Emotions from Speech: A Review of the Literature and Recommendations for Practical Realisation , 2008, Affect and Emotion in Human-Computer Interaction.

[37]  Gordon Hunter,et al.  Formant frequencies of British English vowels produced by native speakers of Farsi , 2012 .

[38]  Carl Vogel,et al.  Needs and challenges in human computer interaction for processing social emotional information , 2015, Pattern Recognit. Lett..

[39]  Gholamreza Anbarjafari,et al.  SASE: RGB-Depth Database for Human Head Pose Estimation , 2016, ECCV Workshops.

[40]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech: a review , 2012, International Journal of Speech Technology.

[41]  Nicu Sebe,et al.  Authentic Facial Expression Analysis , 2004, FGR.

[42]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[43]  Kornel Laskowski,et al.  Emotion recognition in spontaneous speech using GMMs , 2006, INTERSPEECH.

[44]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Gholamreza Anbarjafari,et al.  Expression Recognition by Using Facial and Vocal Expressions , 2014, VL@COLING.

[46]  Zhi-Hua Zhou,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[47]  Alessandro Gasparetto,et al.  A system for feature classification of emotions based on speech analysis; applications to human-robot interaction , 2014, 2014 Second RSI/ISM International Conference on Robotics and Mechatronics (ICRoM).

[48]  Chung-Hsien Wu,et al.  Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels , 2015, IEEE Transactions on Affective Computing.

[49]  S. Lalitha,et al.  Speech emotion recognition , 2014, 2014 International Conference on Advances in Electronics Computers and Communications.

[50]  Thomas B. Moeslund,et al.  Spatio-temporal Pain Recognition in CNN-Based Super-Resolved Facial Images , 2016, VAAM/FFER@ICPR.

[51]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..