Speech emotion detection using time dependent self organizing maps

We propose a framework for speech emotion detection that maps acoustic features into high level descriptors that integrates time context. Our framework uses three different algorithms to integrate the temporal context. The first method is based on temporal averaging of the original features. The second algorithm derives the descriptors by clustering the data using self-organizing maps (SOMs) and computing the temporal average of the activity distribution of the original features on the map. The third algorithm uses multi resolution window analysis and SOMs to compute a 2-D map of emotions and derives high level trajectories representing the behavior of the original features on the map. Using a standard emotional database and K-nearest neighbors classifier, we show that the proposed framework is efficient for analysis, visualization and classification of emotions.

[1]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[2]  Kari Torkkola,et al.  Using the topology-preserving properties of SOFMs in speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Yongzhao Zhan,et al.  Adaptive and Optimal Classification of Speech Emotion Recognition , 2008, 2008 Fourth International Conference on Natural Computation.

[4]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[5]  Risto Miikkulainen,et al.  SARDNET: A Self-Organizing Feature Map for Sequences , 1994, NIPS.

[6]  E. Pampalk Islands of Music Analysis, Organization, and Visualization of Music Archives , 2002 .

[7]  Elias Pampalk,et al.  Please Scroll down for Article Journal of New Music Research the Som-enhanced Jukebox: Organization and Visualization of Music Collections Based on Perceptual Models , 2022 .

[8]  Adel Said Elmaghraby,et al.  Breast segmentation in screening mammograms using multiscale analysis and self-organizing maps , 2004, The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[9]  Dimitrios Ververidis,et al.  A State of the Art Review on Emotional Speech Databases , 2003 .

[10]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[11]  S. Ramakrishnan Recognition of Emotion from Speech: A Review , 2012 .

[12]  John L. Arnott,et al.  Emotional stress in synthetic speech: Progress and future directions , 1996, Speech Commun..

[13]  B. Atal Automatic Speaker Recognition Based on Pitch Contours , 1969 .

[14]  Kornel Laskowski,et al.  Emotion recognition in spontaneous speech using GMMs , 2006, INTERSPEECH.

[15]  H. Wakita Residual energy of linear prediction applied to vowel and speaker recognition , 1976 .

[16]  Jari Kangas,et al.  Time-delayed self-organizing maps , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[17]  Shashidhar G. Koolagudi,et al.  Speech Emotion Recognition: A Review , 2013 .

[18]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[19]  L. Satish,et al.  Use of hidden Markov models for partial discharge pattern classification , 1993 .

[20]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[21]  Andreas Rauber,et al.  Analytic Comparison of Audio Feature Sets using Self-Organising Maps , 2009 .

[22]  Tsang-Long Pao,et al.  Combining Acoustic Features for Improved Emotion Recognition in Mandarin Speech , 2005, ACII.

[23]  Jari Kangas Phoneme recognition using time-dependent versions of self-organizing maps , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[24]  Georgia D. Tourassi,et al.  Self-organizing maps for masking mammography images , 2003, 4th International IEEE EMBS Special Topic Conference on Information Technology Applications in Biomedicine, 2003..

[25]  Haibin Ling,et al.  Diffusion Distance for Histogram Comparison , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[26]  Jouko Lampinen,et al.  Temporal Kohonen Map and the Recurrent Self-Organizing Map: Analytical and Experimental Comparison , 2004, Neural Processing Letters.

[27]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech: a review , 2012, International Journal of Speech Technology.

[28]  Shen Huang,et al.  Music Genre Classification Based on Multiple Classifier Fusion , 2008, 2008 Fourth International Conference on Natural Computation.

[29]  A.E. Rosenberg,et al.  Automatic speaker verification: A review , 1976, Proceedings of the IEEE.

[30]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[31]  Jon Sánchez,et al.  Automatic emotion recognition using prosodic parameters , 2005, INTERSPEECH.

[32]  Elias Pampalk,et al.  Content-based organization and visualization of music archives , 2002, MULTIMEDIA '02.

[33]  Michael Biehl,et al.  Dynamics and Generalization Ability of LVQ Algorithms , 2007, J. Mach. Learn. Res..