Automated Recognition of Paralinguistic Signals in Spoken Dialogue Systems: Ways of Improvement

The ability of artificial systems to recognize paralinguistic signals, such as emotions, depression, or openness, is useful in various applications. However, the performance of such recognizers is not yet perfect. In this study we consider several directions which can significantly improve the performance of such systems. Firstly, we propose building speakeror gender-specific emotion models. Thus, an emotion recognition (ER) procedure is followed by a genderor speaker-identifier. Speakeror gender-specific information is used either for including into the feature vector directly, or for creating separate emotion recognition models for each gender or speaker. Secondly, a feature selection procedure is an important part of any classification problem; therefore, we proposed using a feature selection technique, based on a genetic algorithm or an information gain approach. Both methods result in higher performance than baseline methods without any feature selection algorithms. Finally, we suggest analysing not only audio signals, but also combined audio-visual cues. The early fusion method (or feature-based fusion) has been used in our investigations to combine different modalities into a multimodal approach. The results obtained show that the multimodal approach outperforms single modalities on the considered corpora. The suggested methods have been evaluated on a number of emotional databases of three languages (English, German and Japanese), in both acted and non-acted settings. The results of numerical experiments are also shown in the study.

[1]  Stefan Ultes,et al.  Emotions are a personal thing: Towards speaker-adaptive emotion recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[3]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Sang Uk Lee,et al.  Robust image watermarking using local Zernike moments , 2009, J. Vis. Commun. Image Represent..

[5]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[6]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[7]  P. Jackson,et al.  Multimodal Emotion Recognition , 2010 .

[8]  Michel F. Valstar,et al.  Local Gabor Binary Patterns from Three Orthogonal Planes for Automatic Facial Expression Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[9]  Björn W. Schuller,et al.  AVEC 2013: the continuous audio/visual emotion and depression recognition challenge , 2013, AVEC@ACM Multimedia.

[10]  Hideki Kasuya,et al.  Constructing a spoken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics , 2011, Speech Commun..

[11]  Björn W. Schuller,et al.  MAPTRAITS 2014: The First Audio/Visual Mapping Personality Traits Challenge , 2014, MAPTRAITS '14.

[12]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Wolfgang Minker,et al.  Speaker state recognition with neural network-based classification and self-adaptive heuristic feature selection , 2014, 2014 11th International Conference on Informatics in Control, Automation and Robotics (ICINCO).

[14]  Wolfgang Minker,et al.  Modeling and Predicting Quality in Spoken Human-Computer Interaction , 2011, SIGDIAL Conference.

[15]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.