On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification

Abstract Speech intelligibility can be affected by multiple factors, such as noisy environments, channel distortions or physiological issues. In this work, we deal with the problem of automatic prediction of the speech intelligibility level in this latter case. Starting from our previous work, a non-intrusive system based on LSTM networks with attention mechanism designed for this task, we present two main contributions. In the first one, it is proposed the use of per-frame modulation spectrograms as input features, instead of compact representations derived from them that discard important temporal information. In the second one, two different strategies for the combination of per-frame acoustic log-mel and modulation spectrograms into the LSTM framework are explored: at decision level or late fusion and at utterance level or Weighted-Pooling (WP) fusion. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. On the one hand, results show that attentional LSTM networks are able to adequately modeling the modulation spectrograms sequences producing similar classification rates as in the case of log-mel spectrograms. On the other hand, both combination strategies, late and WP fusion, outperform the single-feature systems, suggesting that per-frame log-mel and modulation spectrograms carry complementary information for the task of speech intelligibility prediction, than can be effectively exploited by the LSTM-based architectures, being the system with the WP fusion strategy and Attention-Pooling the one that achieves best results.

[1]  Abeer Alwan,et al.  Attention Based CLDNNs for Short-Duration Acoustic Scene Classification , 2017, INTERSPEECH.

[2]  Haewon Byeon,et al.  Developing A Model for Predicting the Speech Intelligibility of South Korean Children with Cochlear Implantation using a Random Forest Algorithm , 2018 .

[3]  Ascensión Gallardo Antolín,et al.  UPM-UC3M system for music and speech segmentation , 2010 .

[4]  Tiago H. Falk,et al.  A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  M. Dougherty,et al.  Classification of speech intelligibility in Parkinson's disease , 2014 .

[6]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[7]  H. A. Leeper,et al.  Dysarthric speech: a comparison of computerized speech recognition and listener intelligibility. , 1997, Journal of rehabilitation research and development.

[8]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[9]  Kuldip K. Paliwal,et al.  Role of modulation magnitude and phase spectrum towards speech intelligibility , 2011, Speech Commun..

[10]  Steven Greenberg,et al.  The modulation spectrogram: in pursuit of an invariant representation of speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Ina Kodrasi,et al.  Spectral Subspace Analysis for Automatic Assessment of Pathological Speech Intelligibility , 2019, INTERSPEECH.

[12]  Tiago H. Falk,et al.  Spectral Features for Automatic Blind Intelligibility Estimation of Spastic Dysarthric Speech , 2011, INTERSPEECH.

[13]  J. Gonzalez-Dominguez,et al.  Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks , 2016, PloS one.

[14]  Fraser Shein,et al.  Characterization of atypical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility , 2012, Speech Commun..

[15]  Rubén San-Segundo-Hernández,et al.  Random forest-based prediction of parkinson's disease progression using acoustic, ASR and intelligibility features , 2015, INTERSPEECH.

[16]  Mounya Elhilali,et al.  Modelling auditory attention , 2017, Philosophical Transactions of the Royal Society B: Biological Sciences.

[17]  Juan Manuel Montero-Martínez,et al.  A Saliency-Based Attention LSTM Model for Cognitive Load Classification from Speech , 2019, INTERSPEECH.

[18]  S Scott Whitmore,et al.  A Mutation in LTBP2 Causes Congenital Glaucoma in Domestic Cats (Felis catus) , 2016, PloS one.

[19]  Nick Miller,et al.  Association between objective measurement of the speech intelligibility of young people with dysarthria and listener ratings of ease of understanding , 2014, International journal of speech-language pathology.

[20]  Che-Wei Huang,et al.  Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition , 2016, INTERSPEECH.

[21]  Tiago H. Falk,et al.  Automated Dysarthria Severity Classification for Improved Objective Intelligibility Assessment of Spastic Dysarthric Speech , 2012, INTERSPEECH.

[22]  Matias Garcia-Constantino,et al.  Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review , 2021, Symmetry.

[23]  Ascensión Gallardo-Antolín,et al.  An attention Long Short-Term Memory based system for automatic classification of speech intelligibility , 2020, Eng. Appl. Artif. Intell..

[24]  M. Bodt,et al.  Intelligibility as a linear combination of dimensions in dysarthric speech. , 2002 .

[25]  Emily Mower Provost,et al.  Automatic Assessment of Speech Intelligibility for Individuals With Aphasia , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Thomas S. Huang,et al.  Dysarthric speech database for universal access research , 2008, INTERSPEECH.

[27]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[28]  J. Liss,et al.  Discriminating dysarthria type from envelope modulation spectra. , 2010, Journal of speech, language, and hearing research : JSLHR.

[29]  P. Kuhl,et al.  The effect of reduced vowel working space on speech intelligibility in Mandarin-speaking young adults with cerebral palsy. , 2005, The Journal of the Acoustical Society of America.

[30]  Che-Wei Huang,et al.  Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[31]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  Elmar Nöth,et al.  Automatic intelligibility assessment of speakers after laryngeal cancer by means of acoustic modeling. , 2012, Journal of voice : official journal of the Voice Foundation.

[34]  Carmen Peláez-Moreno,et al.  Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition , 2006, Speech Commun..

[35]  Heidi Christensen,et al.  Intelligibility Assessment and Speech Recognizer Word Accuracy Rate Prediction for Dysarthric Speakers in a Factor Analysis Subspace , 2015, ACM Trans. Access. Comput..

[36]  Raymond D. Kent,et al.  Clinicoanatomic studies in dysarthria: review, critique, and directions for research. , 2001, Journal of speech, language, and hearing research : JSLHR.

[37]  Tiago H. Falk,et al.  Fusion of auditory inspired amplitude modulation spectrum and cepstral features for whispered and normal speech speaker verification , 2017, Comput. Speech Lang..

[38]  Ming Sun,et al.  A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event Classification , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).