Identifying surgical-mask speech using deep neural networks on low-level aggregation

The task of Mask-Speech Identification (MSI) aims at judging whether a chunk of speech is pronounced when the speaker is wearing a facial mask or not. Most of the existing related research focuses on investigating the influence of wearing a mask, which only adapts in some certain cases to speech analysis. Thus in order to generalise the research on MSI, we propose an MSI approach using deep networks on Low-Level Aggregation (LLA) for speech chunks. The proposed approach benefits from data augmentation on Low-Level Descriptors (LLDs), resulting in more adaptation to deep models through inputting much more samples in training without employing pre-trained knowledge. Experiments are performed on the dataset of Mask Augsburg Speech Corpus (MSC) used in the INTERSPEECH 2020 ComParE challenge, considering the influence from employing different strategies. The experimental results show effectiveness of the proposed approach compared with the ComParE challenge baselines.

[1]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[2]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Paavo Alku,et al.  Analysis of Face Mask Effect on Speaker Recognition , 2016, INTERSPEECH.

[4]  Elmar Nöth,et al.  The INTERSPEECH 2015 computational paralinguistics challenge: nativeness, parkinson's & eating condition , 2015, INTERSPEECH.

[5]  Eduardo Coutinho,et al.  The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language , 2016, INTERSPEECH.

[6]  Björn W. Schuller,et al.  The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks , 2020, INTERSPEECH.

[7]  Björn W. Schuller,et al.  Learning Higher Representations from Pre-Trained Deep Models with Data Augmentation for the COMPARE 2020 Challenge Mask Task , 2020, INTERSPEECH.

[8]  Björn Schuller,et al.  openSMILE:): the Munich open-source large-scale multimedia feature extractor , 2015, ACMMR.

[9]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[10]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[11]  Fasih Haider,et al.  An Assessment of Paralinguistic Acoustic Features for Detection of Alzheimer's Dementia in Spontaneous Speech , 2020, IEEE Journal of Selected Topics in Signal Processing.

[12]  Fabien Ringeval,et al.  Affective and behavioural computing: Lessons learnt from the First Computational Paralinguistics Challenge , 2019, Comput. Speech Lang..

[13]  Alan McCree,et al.  Jhu-HLTCOE System for the Voxsrc Speaker Recognition Challenge , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  T. Greenhalgh,et al.  Face masks for the public during the covid-19 crisis , 2020, BMJ.

[15]  Mikko Kurimo,et al.  Aalto's End-to-End DNN systems for the INTERSPEECH 2020 Computational Paralinguistics Challenge , 2020, ArXiv.

[16]  Björn W. Schuller,et al.  Convolutional Neural Networks with Data Augmentation for Classifying Speakers' Native Language , 2016, INTERSPEECH.

[17]  Elmar Nöth,et al.  The INTERSPEECH 2019 Computational Paralinguistics Challenge: Styrian Dialects, Continuous Sleepiness, Baby Sounds & Orca Activity , 2019, INTERSPEECH.

[18]  Hemant A. Patil,et al.  Effectiveness of PLP-based phonetic segmentation for speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Björn W. Schuller,et al.  “You sound ill, take the day off”: Automatic recognition of speech affected by upper respiratory tract infection , 2017, 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[20]  Maja Pantic,et al.  Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Li Zhao,et al.  Autonomous Emotion Learning in Speech: A View of Zero-Shot Speech Emotion Recognition , 2019, INTERSPEECH.

[22]  Lisa Lucks Mendel,et al.  Speech understanding using surgical masks: a problem in health care? , 2008, Journal of the American Academy of Audiology.

[23]  Robert Müller,et al.  Surgical Mask Detection with Convolutional Neural Networks and Data Augmentations on Spectrograms , 2020, INTERSPEECH.

[24]  Julia Hirschberg,et al.  Automatically Classifying Self-Rated Personality Scores from Speech , 2016, INTERSPEECH.

[25]  Eduardo Coutinho,et al.  Connecting Subspace Learning and Extreme Learning Machine in Speech Emotion Recognition , 2019, IEEE Transactions on Multimedia.

[26]  Lisa Lucks Mendel,et al.  The Effect of Conventional and Transparent Surgical Masks on Speech Understanding in Individuals with and without Hearing Loss , 2017, Journal of the American Academy of Audiology.

[27]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[28]  G. Brooks,et al.  Sample Size Considerations for Multiple Comparison Procedures in ANOVA , 2011 .

[29]  James M. Brown,et al.  Deep Learning for Image Quality Assessment of Fundus Images in Retinopathy of Prematurity , 2018, AMIA.

[30]  Kun Qian,et al.  COVID-19 and Computer Audition: An Overview on What Speech & Sound Analysis Could Contribute in the SARS-CoV-2 Corona Crisis , 2020, Frontiers in Digital Health.

[31]  Juliette Millet,et al.  Learning to Detect Dysarthria from Raw Speech , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Milos Cernak,et al.  Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Huy Phan,et al.  Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks , 2016, INTERSPEECH.

[34]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).