Cross corpus multi-lingual speech emotion recognition using ensemble learning

Receiving an accurate emotional response from robots has been a challenging task for researchers for the past few years. With the advancements in technology, robots like service robots interact with users of different cultural and lingual backgrounds. The traditional approach towards speech emotion recognition cannot be utilized to enable the robot and give an efficient and emotional response. The conventional approach towards speech emotion recognition uses the same corpus for both training and testing of classifiers to detect accurate emotions, but this approach cannot be generalized for multi-lingual environments, which is a requirement for robots used by people all across the globe. In this paper, a series of experiments are conducted to highlight an ensemble learning effect using a majority voting technique for cross-corpus, multi-lingual speech emotion recognition system. A comparison of the performance of an ensemble learning approach against traditional machine learning algorithms is performed. This study tests a classifier’s performance trained on one corpus with data from another corpus to evaluate its efficiency for multi-lingual emotion detection. According to experimental analysis, different classifiers give the highest accuracy for different corpora. Using an ensemble learning approach gives the benefit of combining all classifiers’ effect instead of choosing one classifier and compromising certain language corpus’s accuracy. Experiments show an increased accuracy of 13% for Urdu corpus, 8% for German corpus, 11% for Italian corpus, and 5% for English corpus from with-in corpus testing. For cross-corpus experiments, an improvement of 2% when training on Urdu data and testing on German data and 15% when training on Urdu data and testing on Italian data is achieved. An increase of 7% in accuracy is obtained when testing on Urdu data and training on German data, 3% when testing on Urdu data and training on Italian data, and 5% when testing on Urdu data and training on English data. Experiments prove that the ensemble learning approach gives promising results against other state-of-the-art techniques.

[1]  Jianfeng Zhao,et al.  Speech emotion recognition using deep 1D & 2D CNN LSTM networks , 2019, Biomed. Signal Process. Control..

[2]  Abdul Rehman Javed,et al.  Ensemble Adaboost classifier for accurate and fast detection of botnet attacks in connected vehicles , 2020, Trans. Emerg. Telecommun. Technol..

[3]  Mohammad Sayad Haghighi,et al.  Anomaly Detection in Automated Vehicles Using Multistage Attention-Based Convolutional Neural Network , 2021, IEEE Transactions on Intelligent Transportation Systems.

[4]  G ThippaReddy,et al.  Antlion re-sampling based deep neural network model for classification of imbalanced multimodal stroke dataset , 2020, Multimedia Tools and Applications.

[5]  R. Vinayakumar,et al.  A hybrid deep learning image-based analysis for effective malware detection , 2019, J. Inf. Secur. Appl..

[6]  Ngoc Thang Vu,et al.  CRoss-lingual and Multilingual Speech Emotion Recognition on English and French , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Diego H. Milone,et al.  Emotion Recognition in Never-Seen Languages Using a Novel Ensemble Method with Emotion Profiles , 2017, IEEE Transactions on Affective Computing.

[8]  Oludayo O. Olugbara,et al.  Ensemble Learning of Hybrid Acoustic Features for Speech Emotion Recognition , 2020, Algorithms.

[9]  Praveen Kumar Reddy Maddikunta,et al.  An effective feature engineering for DNN using hybrid PCA-GWO for intrusion detection in IoMT architecture , 2020, Comput. Commun..

[10]  Jon Rokne,et al.  Emotion detection from text and speech: a survey , 2018, Social Network Analysis and Mining.

[11]  Banu Diri,et al.  A cross-corpus experiment in speech emotion recognition , 2014, SLAM@INTERSPEECH.

[12]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[13]  Thar Baker,et al.  AlphaLogger: detecting motion-based side-channel attack using smartphone keystrokes , 2020, Journal of Ambient Intelligence and Humanized Computing.

[14]  Erik Marchi,et al.  Enhancing Multilingual Recognition of Emotion in Speech by Language Identification , 2016, INTERSPEECH.

[15]  Björn W. Schuller,et al.  Unsupervised learning in cross-corpus acoustic emotion recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[16]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[17]  Masato Akagi,et al.  A Three-Layer Emotion Perception Model for Valence and Arousal-Based Detection from Multilingual Speech , 2018, INTERSPEECH.

[18]  Rajib Rana,et al.  Cross Corpus Speech Emotion Classification- An Effective Transfer Learning Technique , 2018, ArXiv.

[19]  Björn Schuller,et al.  Cross-Corpus Classification of Realistic Emotions - Some Pilot Experiments , 2010, LREC 2010.

[20]  Celestine Iwendi,et al.  Analyzing the Effectiveness and Contribution of Each Axis of Tri-Axial Accelerometer Sensor for Accurate Activity Recognition , 2020, Sensors.

[21]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[22]  Thar Baker,et al.  Analysis of Dimensionality Reduction Techniques on Big Data , 2020, IEEE Access.

[23]  WollmerMartin,et al.  Cross-Corpus Acoustic Emotion Recognition , 2010 .

[24]  Usman Tariq,et al.  A novel category detection of social media reviews in the restaurant industry , 2020, Multimedia Systems.

[25]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[26]  Muhammad Yaqub,et al.  Texture based localization of a brain tumor from MR-images by using a machine learning approach. , 2020, Medical hypotheses.

[27]  Thomas Fang Zheng,et al.  Transfer learning for speech and language processing , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[28]  Björn Schuller,et al.  Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement , 2019, INTERSPEECH.

[29]  Siddique Latif,et al.  Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languages , 2018, 2018 International Conference on Frontiers of Information Technology (FIT).

[30]  Vivek Tiwari,et al.  A Novel Grid and Place Neuron’s Computational Modeling to Learn Spatial Semantics of an Environment , 2020, Applied Sciences.

[31]  Erik Marchi,et al.  Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[32]  Albert Y. Zomaya,et al.  Tensor-Based Big Data Management Scheme for Dimensionality Reduction Problem in Smart Grid Systems: SDN Perspective , 2018, IEEE Transactions on Knowledge and Data Engineering.

[33]  Léon J. M. Rothkrantz,et al.  Emotion Recognition from Speech by Combining Databases and Fusion of Classifiers , 2010, TSD.

[34]  C. L. Chowdhary,et al.  Deep learning and medical image processing for coronavirus (COVID-19) pandemic: A survey , 2020, Sustainable Cities and Society.

[35]  Abdul Rehman Javed,et al.  DeepAMD: Detection and identification of Android malware using high-efficient Deep Artificial Neural Network , 2021, Future Gener. Comput. Syst..

[36]  Masato Akagi,et al.  Multilingual Speech Emotion Recognition System Based on a Three-Layer Model , 2016, INTERSPEECH.

[37]  Björn W. Schuller,et al.  Using Multiple Databases for Training in Emotion Recognition: To Unite or to Vote? , 2011, INTERSPEECH.

[38]  Zhongzhe Xiao,et al.  Speech emotion recognition cross language families: Mandarin vs. western languages , 2016, 2016 International Conference on Progress in Informatics and Computing (PIC).

[39]  Harshita Patel,et al.  A review on classification of imbalanced data for wireless sensor networks , 2020, Int. J. Distributed Sens. Networks.

[40]  Junfeng Li,et al.  Toward relaying an affective Speech-to-Speech translator: Cross-language perception of emotional state represented by emotion dimensions , 2014, 2014 17th Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA).

[41]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.