An attention Long Short-Term Memory based system for automatic classification of speech intelligibility

Abstract Speech intelligibility can be degraded due to multiple factors, such as noisy environments, technical difficulties or biological conditions. This work is focused on the development of an automatic non-intrusive system for predicting the speech intelligibility level in this latter case. The main contribution of our research on this topic is the use of Long Short-Term Memory (LSTM) networks with log-mel spectrograms as input features for this purpose. In addition, this LSTM-based system is further enhanced by the incorporation of a simple attention mechanism that is able to determine the more relevant frames to this task. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. Results show that the attention LSTM architecture outperforms both, a reference Support Vector Machine (SVM)-based system with hand-crafted features and a LSTM-based system with Mean-Pooling.

[1]  Thomas S. Huang,et al.  Dysarthric speech database for universal access research , 2008, INTERSPEECH.

[2]  J. Gonzalez-Dominguez,et al.  Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks , 2016, PloS one.

[3]  Juan Manuel Montero-Martínez,et al.  A Saliency-Based Attention LSTM Model for Cognitive Load Classification from Speech , 2019, INTERSPEECH.

[4]  Emily Mower Provost,et al.  Automatic Assessment of Speech Intelligibility for Individuals With Aphasia , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Che-Wei Huang,et al.  Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition , 2016, INTERSPEECH.

[6]  Jimmy Ludeña-Choez,et al.  Acoustic Event Classification using spectral band selection and Non-Negative Matrix Factorization-based features , 2016, Expert Syst. Appl..

[7]  Tiago H. Falk,et al.  A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Tiago H. Falk,et al.  Automated Dysarthria Severity Classification for Improved Objective Intelligibility Assessment of Spastic Dysarthric Speech , 2012, INTERSPEECH.

[9]  Fuchun Peng,et al.  Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  M. Dougherty,et al.  Classification of speech intelligibility in Parkinson's disease , 2014 .

[11]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Haewon Byeon,et al.  Developing A Model for Predicting the Speech Intelligibility of South Korean Children with Cochlear Implantation using a Random Forest Algorithm , 2018 .

[13]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[14]  Jorge Cadima,et al.  Principal component analysis: a review and recent developments , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[15]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[16]  Steven Greenberg,et al.  The modulation spectrogram: in pursuit of an invariant representation of speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Mohammad Ali Keyvanrad,et al.  Dysarthric speaker identification with different degrees of dysarthria severity using deep belief networks , 2018, ETRI Journal.

[18]  Carmen Peláez-Moreno,et al.  Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition , 2006, Speech Commun..

[19]  Aboul Ella Hassanien,et al.  Linear discriminant analysis: A detailed tutorial , 2017, AI Commun..

[20]  Mounya Elhilali,et al.  Modelling auditory attention , 2017, Philosophical Transactions of the Royal Society B: Biological Sciences.

[21]  Juan Manuel Montero-Martínez,et al.  External Attention LSTM Models for Cognitive Load Classification from Speech , 2019, SLSP.

[22]  Ascensión Gallardo-Antolín,et al.  Enhancement of a text-independent speaker verification system by using feature combination and parallel structure classifiers , 2018, Neural Computing and Applications.

[23]  Jagannath H. Nirmal,et al.  Thomson Multitaper MFCC and PLP voice features for early detection of Parkinson disease , 2018, Biomed. Signal Process. Control..

[24]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[25]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[26]  H. A. Leeper,et al.  Dysarthric speech: a comparison of computerized speech recognition and listener intelligibility. , 1997, Journal of rehabilitation research and development.

[27]  Fraser Shein,et al.  Characterization of atypical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility , 2012, Speech Commun..

[28]  Heidi Christensen,et al.  Intelligibility Assessment and Speech Recognizer Word Accuracy Rate Prediction for Dysarthric Speakers in a Factor Analysis Subspace , 2015, ACM Trans. Access. Comput..

[29]  Elmar Nöth,et al.  Automatic intelligibility assessment of speakers after laryngeal cancer by means of acoustic modeling. , 2012, Journal of voice : official journal of the Voice Foundation.

[30]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[31]  Nick Miller,et al.  Association between objective measurement of the speech intelligibility of young people with dysarthria and listener ratings of ease of understanding , 2014, International journal of speech-language pathology.

[32]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[33]  Ina Kodrasi,et al.  Spectral Subspace Analysis for Automatic Assessment of Pathological Speech Intelligibility , 2019, INTERSPEECH.