Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load

As a neurophysiological response to threat or adverse con-ditions, stress can affect cognition, emotion and behaviour with potentially detrimental effects on health in the case of sustained exposure. Since the affective content of speech is inherently modulated by an individual’s physical and mental state, a sub-stantial body of research has been devoted to the study of paralinguistic correlates of stress-inducing task load. Historically, voice stress analysis has been conducted using conventional digital signal processing (DSP) techniques. Despite the de-velopment of modern methods based on deep neural networks (DNNs), accurately detecting stress in speech remains difficult due to the wide variety of stressors and considerable variabil-ity in individual stress perception. To that end, we introduce a set of five datasets for task load detection in speech. The voice recordings were collected as either cognitive or physical stress was induced in the cohort of volunteers, with a cumulative num-ber of more than a hundred speakers. We used the datasets to design and evaluate a novel self-supervised audio representation that leverages the effectiveness of handcrafted features (DSP-based) and the complexity of data-driven DNN representations. Notably, the proposed approach outperformed both extensive handcrafted feature sets and novel DNN-based audio representation learning approaches.

[1]  Karl El Hajal,et al.  BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping , 2022, HEAR@NeurIPS.

[2]  A. Jansen,et al.  Universal Paralinguistic Speech Representations Using self-Supervised Conformers , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Milos Cernak,et al.  SERAB: A Multi-Lingual Benchmark for Speech Emotion Recognition , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  K. Kashino,et al.  BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation , 2021, IEEE International Joint Conference on Neural Network.

[7]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[8]  Aren Jansen,et al.  Towards Learning a Universal Non-Semantic Representation of Speech , 2020, INTERSPEECH.

[9]  Juan Manuel Montero-Martínez,et al.  A Saliency-Based Attention LSTM Model for Cognitive Load Classification from Speech , 2019, INTERSPEECH.

[10]  Andreas Wendemuth,et al.  Employing Bottleneck and Convolutional Features for Speech-Based Physical Load Detection on Limited Data Amounts , 2019, INTERSPEECH.

[11]  Elmar Nöth,et al.  Multimodal Assessment of Parkinson's Disease: A Deep Learning Approach , 2019, IEEE Journal of Biomedical and Health Informatics.

[12]  Xavier Neyt,et al.  Voice Stress Analysis: A New Framework for Voice and Effort in Human Performance , 2018, Front. Psychol..

[13]  M. Husain,et al.  Computational modelling reveals distinct patterns of cognitive and physical motivation in elite athletes , 2018, Scientific Reports.

[14]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[15]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[18]  Junichi Yamagishi,et al.  The SIWIS Database: A Multilingual Speech Database with Acted Emphasis , 2016, INTERSPEECH.

[19]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[20]  Jürgen Trouvain,et al.  Prosodic characteristics of read speech before and after treadmill running , 2015, INTERSPEECH.

[21]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22]  Fabien Ringeval,et al.  The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load , 2014, INTERSPEECH.

[23]  Róbert Busa-Fekete,et al.  Detecting the intensity of cognitive and physical load using AdaBoost and deep rectifier neural networks , 2014, INTERSPEECH.

[24]  Yu Tsao,et al.  Ensemble of machine learning algorithms for cognitive and physical speaker load detection , 2014, INTERSPEECH.

[25]  Vidhyasaharan Sethu,et al.  The UNSW submission to INTERSPEECH 2014 compare cognitive load challenge , 2014, INTERSPEECH.

[26]  Shrikanth S. Narayanan,et al.  Classification of cognitive load from speech using an i-vector framework , 2014, INTERSPEECH.

[27]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[28]  Jennifer byrd-craven,et al.  Vocal indices of stress: a review. , 2013, Journal of voice : official journal of the Voice Foundation.

[29]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[30]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[31]  Kristian Lukander,et al.  Estimating Brain Load from the EEG , 2009, TheScientificWorldJournal.

[32]  H. Alessio,et al.  Ventilation and speech characteristics during submaximal aerobic exercise. , 2008, Journal of speech, language, and hearing research : JSLHR.

[33]  John H. L. Hansen,et al.  Analysis and perception of speech under physical task stress , 2008, INTERSPEECH.

[34]  Bernd Johannes,et al.  Non-linear function model of voice pitch dependency on physical and mental load , 2007, European Journal of Applied Physiology.

[35]  Kilseop Ryu,et al.  Evaluation of mental workload with a combined measure based on physiological indices during a dual task of tracking and mental arithmetic , 2005 .

[36]  Léon J. M. Rothkrantz,et al.  Voice Stress Analysis , 2004, TSD.

[37]  L Léger,et al.  An indirect continuous running multistage field test: the Université de Montréal track test. , 1980, Canadian journal of applied sport sciences. Journal canadien des sciences appliquees au sport.