An Auditory Saliency Pooling-Based LSTM Model for Speech Intelligibility Classification

Speech intelligibility is a crucial element in oral communication that can be influenced by multiple elements, such as noise, channel characteristics, or speech disorders. In this paper, we address the task of speech intelligibility classification (SIC) in this last circumstance. Taking our previous works, a SIC system based on an attentional long short-term memory (LSTM) network, as a starting point, we deal with the problem of the inadequate learning of the attention weights due to training data scarcity. For overcoming this issue, the main contribution of this paper is a novel type of weighted pooling (WP) mechanism, called saliency pooling where the WP weights are not automatically learned during the training process of the network, but are obtained from an external source of information, the Kalinli’s auditory saliency model. In this way, it is intended to take advantage of the apparent symmetry between the human auditory attention mechanism and the attentional models integrated into deep learning networks. The developed systems are assessed on the UA-speech dataset that comprises speech uttered by subjects with several dysarthria levels. Results show that all the systems with saliency pooling significantly outperform a reference support vector machine (SVM)-based system and LSTM-based systems with mean pooling and attention pooling, suggesting that Kalinli’s saliency can be successfully incorporated into the LSTM architecture as an external cue for the estimation of the speech intelligibility level.

[1]  Fraser Shein,et al.  Characterization of atypical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility , 2012, Speech Commun..

[2]  Mounya Elhilali,et al.  Modelling auditory attention , 2017, Philosophical Transactions of the Royal Society B: Biological Sciences.

[3]  Roman Jarina,et al.  A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism , 2021, Electronics.

[4]  Abeer Alwan,et al.  Attention Based CLDNNs for Short-Duration Acoustic Scene Classification , 2017, INTERSPEECH.

[5]  Juan Manuel Montero-Martínez,et al.  A Saliency-Based Attention LSTM Model for Cognitive Load Classification from Speech , 2019, INTERSPEECH.

[6]  Juan Manuel Montero-Martínez,et al.  External Attention LSTM Models for Cognitive Load Classification from Speech , 2019, SLSP.

[7]  S. Shamma On the role of space and time in auditory processing , 2001, Trends in Cognitive Sciences.

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Ascensión Gallardo-Antolín,et al.  Detecting Deception from Gaze and Speech Using a Multimodal Attention LSTM-Based Framework , 2021, Applied Sciences.

[10]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[11]  Carmen Peláez-Moreno,et al.  Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition , 2006, Speech Commun..

[12]  M. Bodt,et al.  Intelligibility as a linear combination of dimensions in dysarthric speech. , 2002 .

[13]  Shrikanth S. Narayanan,et al.  Prominence Detection Using Auditory Attention Cues and Task-Dependent High Level Information , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  H. A. Leeper,et al.  Dysarthric speech: a comparison of computerized speech recognition and listener intelligibility. , 1997, Journal of rehabilitation research and development.

[15]  Haewon Byeon,et al.  Developing A Model for Predicting the Speech Intelligibility of South Korean Children with Cochlear Implantation using a Random Forest Algorithm , 2018 .

[16]  N. Sreedevi,et al.  Spectro-Temporal Representation of Speech for Intelligibility Assessment of Dysarthria , 2020, IEEE Journal of Selected Topics in Signal Processing.

[17]  Nick Miller,et al.  Association between objective measurement of the speech intelligibility of young people with dysarthria and listener ratings of ease of understanding , 2014, International journal of speech-language pathology.

[18]  Che-Wei Huang,et al.  Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition , 2016, INTERSPEECH.

[19]  Sunil Kumar Kopparapu,et al.  Improved Speaker Independent Dysarthria Intelligibility Classification Using Deepspeech Posteriors , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  C Alain,et al.  Selectively attending to auditory objects. , 2000, Frontiers in bioscience : a journal and virtual library.

[21]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[22]  Michael T. Lippert,et al.  Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map , 2005, Current Biology.

[23]  Carmen Peláez-Moreno,et al.  Echoic log-surprise: A multi-scale scheme for acoustic saliency detection , 2018, Expert Syst. Appl..

[24]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  M. Dougherty,et al.  Classification of speech intelligibility in Parkinson's disease , 2014 .

[26]  J. Liss,et al.  Discriminating dysarthria type from envelope modulation spectra. , 2010, Journal of speech, language, and hearing research : JSLHR.

[27]  Helmer Strik,et al.  Automatic Assessment of Sentence-Level Dysarthria Intelligibility Using BLSTM , 2020, IEEE Journal of Selected Topics in Signal Processing.

[28]  Matias Garcia-Constantino,et al.  Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review , 2021, Symmetry.

[29]  Ming Sun,et al.  A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event Classification , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Ascensión Gallardo-Antolín,et al.  Automatic Detection of Depression in Speech Using Ensemble Convolutional Neural Networks , 2020, Entropy.

[31]  Juan Manuel Montero-Martínez,et al.  On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification , 2021, Neurocomputing.

[32]  Shoukang Hu,et al.  Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition , 2021, Interspeech.

[33]  Shrikanth S. Narayanan,et al.  Saliency-driven unstructured acoustic scene classification using latent perceptual indexing , 2009, 2009 IEEE International Workshop on Multimedia Signal Processing.

[34]  Ascensión Gallardo-Antolín,et al.  An attention Long Short-Term Memory based system for automatic classification of speech intelligibility , 2020, Eng. Appl. Artif. Intell..