论文信息 - Are you wearing a mask? Improving mask detection from speech using augmentation by cycle-consistent GANs

Are you wearing a mask? Improving mask detection from speech using augmentation by cycle-consistent GANs

The task of detecting whether a person wears a face mask from speech is useful in modelling speech in forensic investigations, communication between surgeons or people protecting themselves against infectious diseases such as COVID-19. In this paper, we propose a novel data augmentation approach for mask detection from speech. Our approach is based on (i) training Generative Adversarial Networks (GANs) with cycle-consistency loss to translate unpaired utterances between two classes (with mask and without mask), and on (ii) generating new training utterances using the cycle-consistent GANs, assigning opposite labels to each translated utterance. Original and translated utterances are converted into spectrograms which are provided as input to a set of ResNet neural networks with various depths. The networks are combined into an ensemble through a Support Vector Machines (SVM) classifier. With this system, we participated in the Mask Sub-Challenge (MSC) of the INTERSPEECH 2020 Computational Paralinguistics Challenge, surpassing the baseline proposed by the organizers by 2.8%. Our data augmentation technique provided a performance boost of 0.9% on the private test set. Furthermore, we show that our data augmentation approach yields better results than other baseline and state-of-the-art augmentation methods.

Radu Tudor Ionescu | Nicolae-Cuatualin Ristea | Nicolae-Cuatualin Ristea

[1] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[3] Björn W. Schuller,et al. The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks , 2020, INTERSPEECH.

[4] Paavo Alku,et al. Analysis of Face Mask Effect on Speaker Recognition , 2016, INTERSPEECH.

[5] Alexei A. Efros,et al. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6] Jung-Woo Ha,et al. StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7] 拓海杉山,et al. “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[8] Lisa Lucks Mendel,et al. Speech understanding using surgical masks: a problem in health care? , 2008, Journal of the American Academy of Audiology.

[9] Carlos Guestrin,et al. "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[10] Nello Cristianini,et al. Kernel Methods for Pattern Analysis , 2003, ICTAI.

[11] Shrikanth Narayanan,et al. Data Augmentation Using GANs for Speech Emotion Recognition , 2019, INTERSPEECH.

[12] Paavo Alku,et al. Speaker recognition for speech under face cover , 2015, INTERSPEECH.

[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Constantine Bekas,et al. BAGAN: Data Augmentation with Balancing GAN , 2018, ArXiv.

[15] Radu Tudor Ionescu,et al. Convolutional Neural Networks With Intermediate Loss for 3D Super-Resolution of CT and MRI Scans , 2020, IEEE Access.

[16] J.B. Allen,et al. A unified approach to short-time Fourier analysis and synthesis , 1977, Proceedings of the IEEE.

[17] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[18] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[19] Minjae Kim,et al. U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation , 2019, ICLR.

[20] Trevor Darrell,et al. Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] Simon Osindero,et al. Conditional Generative Adversarial Nets , 2014, ArXiv.

[22] Xiaomin Song,et al. Time Series Data Augmentation for Deep Learning: A Survey , 2020, ArXiv.

[23] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[24] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26] Radu Tudor Ionescu,et al. Knowledge Transfer between Computer Vision and Text Mining , 2016, Advances in Computer Vision and Pattern Recognition.

[27] Sanjeev Khudanpur,et al. Audio augmentation for speech recognition , 2015, INTERSPEECH.