Speaker state classification based on fusion of asymmetric simple partial least squares (SIMPLS) and support vector machines

This paper presents our studies of the effects of acoustic features, speaker normalization methods, and statistical modeling techniques on speaker state classification. We focus on the investigation of the effect of simple partial least squares (SIMPLS) in unbalanced binary classification. Beyond dimension reduction and low computational complexity, SIMPLS classifier (SIMPLSC) shows, especially, higher prediction accuracy to the class with the smaller data number. Therefore, an asymmetric SIMPLS classifier (ASIMPLSC) is proposed to enhance the performance of SIMPLSC to the class with the larger data number. Furthermore, we combine multiple system outputs (ASIMPLS classifier and Support Vector Machines) by score-level fusion to exploit the complementary information in diverse systems. The proposed speaker state classification system is evaluated with several experiments on unbalanced data sets. Within the Interspeech 2011 Speaker State Challenge, we could achieve the best results for the 2-class task of the Sleepiness Sub-Challenge with an unweighted average recall of 71.7%. Further experimental results on the SEMAINE data sets show that the ASIMPLSC achieves an absolute improvement of 6.1%, 6.1%, 24.5%, and 1.3% on the weighted average recall value, over the AVEC 2011 baseline system on the emotional speech binary classification tasks of four dimensions, namely, activation, expectation, power, and valence, respectively.

[1]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2]  Wei Sun,et al.  A comparison of SVM and asymmetric SIMPLS in emotion recognition from naturalistic dialogues , 2012, 2012 IEEE International Symposium on Circuits and Systems.

[3]  Guo-Zheng Li,et al.  An asymmetric classifier based on partial least squares , 2010, Pattern Recognit..

[4]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[5]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[6]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[7]  Angeliki Metallinou,et al.  Speaker states recognition using latent factor analysis based Eigenchannel factor vector modeling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  P. Caffier,et al.  Experimental evaluation of eye-blink parameters as a drowsiness measure , 2003, European Journal of Applied Physiology.

[9]  M. Hubert,et al.  Robust methods for partial least squares regression , 2003 .

[10]  Maja Pantic,et al.  The SEMAINE corpus of emotionally coloured character interactions , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[11]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[12]  H. Wold Path Models with Latent Variables: The NIPALS Approach , 1975 .

[13]  Lukás Burget,et al.  Brno University of Technology system for Interspeech 2009 emotion challenge , 2009, INTERSPEECH.

[14]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[15]  Larry S. Davis,et al.  Human detection using partial least squares analysis , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[17]  Jarek Krajewski,et al.  An Acoustic Framework for Detecting Fatigue in Speech Based Human-Computer-Interaction , 2008, ICCHP.

[18]  Douglas D. O'Shaughnessy Speech Communications: Human and Machine , 2012 .

[19]  Jarek Krajewski,et al.  Using prosodic and spectral characteristics for sleepiness detection , 2007, INTERSPEECH.

[20]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[21]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[22]  Shuzhi Sam Ge,et al.  Speaker State Classification Based on Fusion of Asymmetric SIMPLS and Support Vector Machines , 2011, INTERSPEECH.

[23]  Renee van Bezooyen Characteristics and recognizability of vocal expressions of emotion , 1984 .

[24]  Martin Golz,et al.  Acoustic sleepiness detection: Framework and validation of a speech-adapted pattern recognition approach , 2009, Behavior research methods.

[25]  Björn W. Schuller,et al.  The INTERSPEECH 2011 Speaker State Challenge , 2011, INTERSPEECH.

[26]  Pierre Dumouchel,et al.  Cepstral and long-term features for emotion recognition , 2009, INTERSPEECH.

[27]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[28]  Björn Schuller,et al.  Emotion recognition in the noise applying large acoustic feature sets , 2006, Speech Prosody 2006.

[29]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[30]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[31]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[32]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[33]  K. Scherer,et al.  The World of Emotions is not Two-Dimensional , 2007, Psychological science.

[34]  K. Scherer,et al.  Effect of experimentally induced stress on vocal parameters. , 1986, Journal of experimental psychology. Human perception and performance.

[35]  Mandy Eberhart,et al.  Speech Communications Human And Machine , 2016 .

[36]  R. Manne Analysis of two partial-least-squares algorithms for multivariate calibration , 1987 .

[37]  Dirk Heylen,et al.  The Sensitive Artificial Listner: an induction technique for generating emotionally coloured conversation , 2008 .

[38]  Björn W. Schuller,et al.  Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space? , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.