A Novel Fusion of Attention and Sequence to Sequence Autoencoders to Predict Sleepiness From Speech

Motivated by the attention mechanism of the human visual system and recent developments in the field of machine translation, we introduce our attention-based and recurrent sequence to sequence autoencoders for fully unsupervised representation learning from audio files. In particular, we test the efficacy of our novel approach on the task of speech-based sleepiness recognition. We evaluate the learnt representations from both autoencoders, and then conduct an early fusion to ascertain possible complementarity between them. In our frameworks, we first extract Mel-spectrograms from raw audio files. Second, we train recurrent autoencoders on these spectrograms which are considered as time-dependent frequency vectors. Afterwards, we extract the activations of specific fully connected layers of the autoencoders which represent the learnt features of spectrograms for the corresponding audio instances. Finally, we train support vector regressors on these representations to obtain the predictions. On the development partition of the data, we achieve Spearman's correlation coefficients of .324, .283, and .320 with the targets on the Karolinska Sleepiness Scale by utilising attention and non-attention autoencoders, and the fusion of both autoencoders' representations, respectively. In the same order, we achieve .311, .359, and .367 Spearman's correlation coefficients on the test data, indicating the suitability of our proposed fusion strategy.

[1]  Elmar Nöth,et al.  The INTERSPEECH 2019 Computational Paralinguistics Challenge: Styrian Dialects, Continuous Sleepiness, Baby Sounds & Orca Activity , 2019, INTERSPEECH.

[2]  Shahin Amiriparian,et al.  Deep Representation Learning Techniques for Audio Signal Processing , 2019 .

[3]  Bin Yang,et al.  Camera-based drowsiness reference for driver state classification under real driving conditions , 2010, 2010 IEEE Intelligent Vehicles Symposium.

[4]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[5]  Alina Mashko Review of approaches to the problem of driver fatigue and drowsiness , 2015, 2015 Smart Cities Symposium Prague (SCSP).

[6]  C. Guilleminault,et al.  Fatigue, sleep restriction and driving performance. , 2005, Accident; analysis and prevention.

[7]  Chi-Chun Lee,et al.  Using Attention Networks and Adversarial Augmentation for Styrian Dialect Continuous Sleepiness and Baby Sound Recognition , 2019, INTERSPEECH.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Björn W. Schuller,et al.  A Fusion of Deep Convolutional Generative Adversarial Networks and Sequence to Sequence Autoencoders for Acoustic Scene Classification , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[10]  Björn W. Schuller,et al.  The INTERSPEECH 2011 Speaker State Challenge , 2011, INTERSPEECH.

[11]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[13]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[14]  Gábor Gosztolya,et al.  Using Fisher Vector and Bag-of-Audio-Words Representations to Identify Styrian Dialects, Sleepiness, Baby & Orca Sounds , 2019, INTERSPEECH.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[17]  Björn Schuller,et al.  Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio , 2017, DCASE.

[18]  Martin Golz,et al.  Acoustic sleepiness detection: Framework and validation of a speech-adapted pattern recognition approach , 2009, Behavior research methods.

[19]  Björn W. Schuller,et al.  Deep Unsupervised Representation Learning for Abnormal Heart Sound Classification , 2018, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[20]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[21]  Aamir Saeed Malik,et al.  A Review on EEG-Based Automatic Sleepiness Detection Systems for Driver , 2018, IEEE Access.

[22]  Yoshua Bengio,et al.  Professor Forcing: A New Algorithm for Training Recurrent Networks , 2016, NIPS.

[23]  William D S Killgore,et al.  Effects of sleep deprivation on cognition. , 2010, Progress in brain research.

[24]  Björn W. Schuller,et al.  Feature selection in multimodal continuous emotion prediction , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW).

[25]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[26]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[27]  Che-Wei Huang,et al.  Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[28]  Björn W. Schuller,et al.  Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks , 2018 .

[29]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  J. Lyznicki,et al.  Sleepiness, driving, and motor vehicle crashes , 1998 .

[31]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[32]  M. Carskadon,et al.  Current perspectives on daytime sleepiness: the issues. , 1982, Sleep.