External Attention LSTM Models for Cognitive Load Classification from Speech

Cognitive Load (CL) refers to the amount of mental demand that a given task imposes on an individual’s cognitive system and it can affect his/her productivity in very high load situations. In this paper, we propose an automatic system capable of classifying the CL level of a speaker by analyzing his/her voice. We focus on the use of Long Short-Term Memory (LSTM) networks with different weighted pooling strategies, such as mean-pooling, max-pooling, last-pooling and a logistic regression attention model. In addition, as an alternative to the previous methods, we propose a novel attention mechanism, called external attention model, that uses external cues, such as log-energy and fundamental frequency, for weighting the contribution of each LSTM temporal frame, overcoming the need of a large amount of data for training the attentional model. Experiments show that the LSTM-based system with external attention model outperforms significantly the baseline system based on Support Vector Machines (SVM) and the LSTM-based systems with the conventional weighed pooling schemes and with the logistic regression attention model.

[1]  Erik Marchi,et al.  Real-time robust recognition of speakers' emotions and characteristics on mobile platforms , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[2]  Fabien Ringeval,et al.  The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load , 2014, INTERSPEECH.

[3]  J. Stroop Studies of interference in serial verbal reactions. , 1992 .

[4]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[5]  Jimmy Ludeña-Choez,et al.  Feature extraction based on the high-pass filtering of audio signals for Acoustic Event Classification , 2015, Comput. Speech Lang..

[6]  François Chollet,et al.  Keras: The Python Deep Learning library , 2018 .

[7]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Fuchun Peng,et al.  Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[10]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[11]  Christian A. Müller,et al.  Recognizing Time Pressure and Cognitive Load on the Basis of Speech: An Experimental Study , 2001, User Modeling.

[12]  John H. L. Hansen,et al.  Analysis and detection of cognitive load and frustration in drivers' speech , 2010, INTERSPEECH.

[13]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[14]  Vidhyasaharan Sethu,et al.  The UNSW submission to INTERSPEECH 2014 compare cognitive load challenge , 2014, INTERSPEECH.

[15]  J. Gonzalez-Dominguez,et al.  Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks , 2016, PloS one.

[16]  D B Pisoni,et al.  Effects of cognitive workload on speech production: acoustic analyses and perceptual consequences. , 1993, The Journal of the Acoustical Society of America.

[17]  Jimmy Ludeña-Choez,et al.  Acoustic Event Classification using spectral band selection and Non-Negative Matrix Factorization-based features , 2016, Expert Syst. Appl..

[18]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[19]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[20]  Eero Väyrynen,et al.  Effect of cognitive load on speech prosody in aviation: Evidence from military simulator flights. , 2011, Applied ergonomics.

[21]  Che-Wei Huang,et al.  Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[22]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Che-Wei Huang,et al.  Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition , 2016, INTERSPEECH.

[24]  Shrikanth S. Narayanan,et al.  Classification of cognitive load from speech using an i-vector framework , 2014, INTERSPEECH.