Apprentissage profond appliqué à la reconnaissance des émotions dans la voix. (Deep learning applied to speech emotion recognition)

Mes travaux de these s'interessent a l'utilisation de nouvelles technologies d'intelligence artificielle appliquees a la problematique de la classification automatique des sequences audios selon l'etat emotionnel du client au cours d'une conversation avec un teleconseiller. En 2016, l'idee est de se demarquer des pretraitements de donnees et modeles d'apprentissage automatique existant au sein du laboratoire, et de proposer un modele qui soit le plus performant possible sur la base de donnees audios IEMOCAP. Nous nous appuyons sur des travaux existants sur les modeles de reseaux de neurones profonds pour la reconnaissance de la parole, et nous etudions leur extension au cas de la reconnaissance des emotions dans la voix. Nous nous interessons ainsi a l'architecture neuronale bout-en-bout qui permet d'extraire de maniere autonome les caracteristiques acoustiques du signal audio en vue de la tâche de classification a realiser. Pendant longtemps, le signal audio est pretraite avec des indices paralinguistiques dans le cadre d'une approche experte. Nous choisissons une approche naive pour le pretraitement des donnees qui ne fait pas appel a des connaissances paralinguistiques specialisees afin de comparer avec l'approche experte. Ainsi le signal audio brut est transforme en spectrogramme temps-frequence a l'aide d'une transformee de Fourier a court-terme. Exploiter un reseau neuronal pour une tâche de prediction precise implique de devoir s'interroger sur plusieurs aspects. D'une part, il convient de choisir les meilleurs hyperparametres possibles. D'autre part, il faut minimiser les biais presents dans la base de donnees (non discrimination) en ajoutant des donnees par exemple et prendre en compte les caracteristiques de la base de donnees choisie. Le but est d'optimiser le mieux possible l'algorithme de classification. Nous etudions ces aspects pour une architecture neuronale bout-en-bout qui associe des couches convolutives specialisees dans le traitement de l'information visuelle, et des couches recurrentes specialisees dans le traitement de l'information temporelle. Nous proposons un modele d'apprentissage supervise profond competitif avec l'etat de l'art sur la base de donnees IEMOCAP et cela justifie son utilisation pour le reste des experimentations. Ce modele de classification est constitue de quatre couches de reseaux de neurones a convolution et un reseau de neurones recurrent bidirectionnel a memoire court-terme et long-terme (BLSTM). Notre modele est evalue sur deux bases de donnees audios anglophones proposees par la communaute scientifique : IEMOCAP et MSP-IMPROV. Une premiere contribution est de montrer qu'avec un reseau neuronal profond, nous obtenons de hautes performances avec IEMOCAP et que les resultats sont prometteurs avec MSP-IMPROV. Une autre contribution de cette these est une etude comparative des valeurs de sortie des couches du module convolutif et du module recurrent selon le pretraitement de la voix opere en amont : spectrogrammes (approche naive) ou indices paralinguistiques (approche experte). A l'aide de la distance euclidienne, une mesure de proximite deterministe, nous analysons les donnees selon l'emotion qui leur est associee. Nous tentons de comprendre les caracteristiques de l'information emotionnelle extraite de maniere autonome par le reseau. L'idee est de contribuer a une recherche centree sur la comprehension des reseaux de neurones profonds utilises en reconnaissance des emotions dans la voix et d'apporter plus de transparence et d'explicabilite a ces systemes dont le mecanisme decisionnel est encore largement incompris.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  Louis-Philippe Morency,et al.  Representation Learning for Speech Emotion Recognition , 2016, INTERSPEECH.

[4]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  Kjell Elenius,et al.  Emotion Recognition , 2009, Computers in the Human Interaction Loop.

[6]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[7]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  W. Cannon The James-Lange theory of emotions: a critical examination and an alternative theory. By Walter B. Cannon, 1927. , 1927, American Journal of Psychology.

[9]  Ngoc Thang Vu,et al.  Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech , 2017, INTERSPEECH.

[10]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Mohamed Chtourou,et al.  On the training of recurrent neural networks , 2011, Eighth International Multi-Conference on Systems, Signals & Devices.

[12]  Laurence Devillers,et al.  Designing an Emotion Detection System for a Socially Intelligent Human-Robot Interaction , 2012, Natural Interaction with Robots, Knowbots and Smartphones, Putting Spoken Dialog Systems into Practice.

[13]  Tao Wang,et al.  End-to-end text recognition with convolutional neural networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[14]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[15]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[16]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[17]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[18]  Sander Bohte,et al.  Conditional Time Series Forecasting with Convolutional Neural Networks , 2017, 1703.04691.

[19]  William J. Christmas,et al.  When Face Recognition Meets with Deep Learning: An Evaluation of Convolutional Neural Networks for Face Recognition , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[20]  Lori Lamel,et al.  Challenges in real-life emotion annotation and machine learning based detection , 2005, Neural Networks.

[21]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[22]  Georges Linarès,et al.  La parole spontanée : transcription et traitement , 2008 .

[23]  G. Mounin Dictionnaire de la linguistique , 1995 .

[24]  Kim Gerdes,et al.  Actes d ’ IDP 09 191 Prosodic hierarchy and spectral realization of vowels in French Cédric , 2010 .

[25]  Amit Agarwal,et al.  CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.

[26]  Carlos Busso,et al.  MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception , 2017, IEEE Transactions on Affective Computing.

[27]  Che-Wei Huang,et al.  Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition , 2016, INTERSPEECH.

[28]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[29]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[30]  Björn W. Schuller,et al.  CINEMO - A French Spoken Language Resource for Complex Emotions: Facts and Baselines , 2010, LREC.

[31]  C. Darwin The Expression of the Emotions in Man and Animals , .

[32]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[33]  Laurence Vidrascu,et al.  Analyse et détection des émotions verbales dans les interactions orales. (Analysis and detection of emotions in real-life spontaneous speech) , 2007 .

[34]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[37]  Louis ten Bosch,et al.  Information Encoding by Deep Neural Networks: What Can We Learn? , 2018, INTERSPEECH.

[38]  Efthymios Tzinis,et al.  Segment-based speech emotion recognition using recurrent neural networks , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[39]  J. Russell,et al.  Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant. , 1999, Journal of personality and social psychology.

[40]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[41]  Laurence Devillers,et al.  Protocol CINEMO: The use of fiction for collecting emotional data in naturalistic controlled oriented context , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[42]  Luc Mioulet,et al.  Reconnaissance de l'écriture manuscrite avec des réseaux récurrents. (Recurent neural network for handwriting recognition) , 2015 .

[43]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[44]  Dong Yu,et al.  Efficient and effective algorithms for training single-hidden-layer neural networks , 2012, Pattern Recognit. Lett..

[45]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[46]  N. Martaj,et al.  Réseaux de neurones , 2010 .

[47]  Laurence Devillers,et al.  CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation , 2018, Workshop on Speech, Music and Mind (SMM 2018).

[48]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Marie Tahon,et al.  Inference of Human Beings’ Emotional States from Speech in Human–Robot Interactions , 2015, Int. J. Soc. Robotics.

[50]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[51]  J. Dubois Dictionnaire de linguistique , 1973 .

[52]  D. Isaacowitz,et al.  Emotion in Cognition , 2015 .

[53]  Kornel Laskowski,et al.  Combining Efforts for Improving Automatic Classification of Emotional User States , 2006 .

[54]  Anne Lacheret,et al.  The role of intonation and voice quality in the affective speech perception , 2007, INTERSPEECH.

[55]  Xiaodong Cui,et al.  Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[56]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[57]  J. Russell A circumplex model of affect. , 1980 .

[58]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[59]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[60]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[61]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[62]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[63]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Björn W. Schuller,et al.  Patterns, prototypes, performance: classifying emotional user states , 2008, INTERSPEECH.

[66]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Grigoriy Sterling,et al.  Emotion Recognition From Speech With Recurrent Neural Networks , 2017, ArXiv.

[68]  Ron Hoory,et al.  Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms , 2017, INTERSPEECH.

[69]  Björn W. Schuller,et al.  An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech , 2017, ACM Multimedia.

[70]  Vinay Kumar Mittal,et al.  Emotion recognition from speech signal , 2017, TENCON 2017 - 2017 IEEE Region 10 Conference.

[71]  Björn Schuller,et al.  The Automatic Recognition of Emotions in Speech , 2011 .

[72]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[73]  Sean A. Spence,et al.  Descartes' Error: Emotion, Reason and the Human Brain , 1995 .

[74]  Emily Mower Provost,et al.  Progressive Neural Networks for Transfer Learning in Emotion Recognition , 2017, INTERSPEECH.

[75]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[76]  Razvan Pascanu,et al.  Advances in optimizing recurrent networks , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[77]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[78]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[79]  Carlos Busso,et al.  Interpreting ambiguous emotional expressions , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[80]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[81]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[82]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[83]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[84]  Rohit Kumar,et al.  Emotion Recognition using Acoustic and Lexical Features , 2012, INTERSPEECH.

[85]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[86]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[87]  Patrice Y. Simard,et al.  High Performance Convolutional Neural Networks for Document Processing , 2006 .

[88]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[89]  Laurence Devillers,et al.  Représentation et détection des émotions dans des dialogues enregistrés dans un centre d'appel. Des émotions complexes dans des données réelles , 2006, Rev. d'Intelligence Artif..

[90]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[91]  Charu C. Aggarwal,et al.  Neural Networks and Deep Learning , 2018, Springer International Publishing.

[92]  Nicolas Sturmel,et al.  SIGNAL RECONSTRUCTION FROM STFT MAGNITUDE : A STATE OF THE ART , 2011 .

[93]  Mark Hasegawa-Johnson,et al.  Visualizing Phoneme Category Adaptation in Deep Neural Networks , 2018, INTERSPEECH.

[94]  Blockin,et al.  Vocal Expression of Emotion , 2004 .

[95]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[96]  Fei-Fei Li,et al.  Visualizing and Understanding Recurrent Networks , 2015, ArXiv.

[97]  D. Sander,et al.  Théories et concepts contemporains en psychologie de l’émotion , 2010 .

[98]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[99]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.

[100]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[101]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[102]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[103]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[104]  Ramón López-Cózar,et al.  On the Use of Kappa Coefficients to Measure the Reliability of the Annotation of Non-acted Emotions , 2008, PIT.

[105]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[106]  K. Scherer Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.

[107]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[108]  M. Tahon,et al.  Analyse acoustique de la voix émotionnelle de locuteurs lors d’une interaction humain-robot , 2012 .

[109]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[110]  Marie Tahon,et al.  Towards a Small Set of Robust Acoustic Features for Emotion Recognition: Challenges , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[111]  Anne Lacheret le corps en voix ou l'expression prosodique des émotions , 2011 .

[112]  Daniel Luzzati Le fenêtrage syntaxique: une méthode d'analyse et d'évaluation de l'oral spontané , 2004 .

[113]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[114]  Björn W. Schuller,et al.  Emotional Speech of Mentally and Physically Disabled Individuals: Introducing the EmotAsS Database and First Findings , 2017, INTERSPEECH.

[115]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[116]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[117]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[118]  Marie Tahon,et al.  Détection des états affectifs lors d’interactions parlées : robustesse des indices non verbaux [Automatic in-voice affective state detection in spontaneous speech: robustness of non-verbal cues] , 2014, TAL.

[119]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[120]  D. Hubel,et al.  Receptive fields and functional architecture of monkey striate cortex , 1968, The Journal of physiology.

[121]  Tanaya Guha,et al.  Learning Spontaneity to Improve Emotion Recognition In Speech , 2018, INTERSPEECH.

[122]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[123]  Colin Raffel,et al.  Lasagne: First release. , 2015 .