The EIHW-GLAM Deep Attentive Multi-model Fusion System for Cough-based COVID-19 Recognition in the DiCOVA 2021 Challenge

Aiming to automatically detect COVID-19 from cough sounds, we propose a deep attentive multi-model fusion system evaluated on the Track-1 dataset of the DiCOVA 2021 challenge. Three kinds of representations are extracted, including handcrafted features, image-from-audio-based deep representations, and audio-based deep representations. Afterwards, the best models on the three types of features are fused at both the feature level and the decision level. The experimental results demonstrate that the proposed attention-based fusion at the feature level achieves the best performance (AUC: 77.96%) on the test set, resulting in an 8.05% improvement over the official baseline.

[1]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Björn W. Schuller,et al.  CAA-Net: Conditional Atrous CNNs With Attention for Explainable Device-Robust Acoustic Scene Classification , 2020, IEEE Transactions on Multimedia.

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  A. Worster,et al.  Understanding receiver operating characteristic (ROC) curves. , 2006, CJEM.

[5]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Mark D. Plumbley,et al.  PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Mark D. Plumbley,et al.  Attention-based convolutional neural networks for acoustic scene classification , 2018, DCASE.

[10]  Björn W. Schuller,et al.  Audio for Audio is Better? An Investigation on Transfer Learning Models for Heart Sound Classification , 2020, 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC).

[11]  Prasanta Kumar Ghosh,et al.  DiCOVA Challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics , 2021, Interspeech.

[12]  Björn W. Schuller,et al.  The INTERSPEECH 2018 Computational Paralinguistics Challenge: Atypical & Self-Assessed Affect, Crying & Heart Beats , 2018, INTERSPEECH.

[13]  Jarek Krajewski,et al.  Analysis and Classification of Cold Speech Using Variational Mode Decomposition , 2020, IEEE Transactions on Affective Computing.

[14]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[15]  Deepak Baby,et al.  Sergan: Speech Enhancement Using Relativistic Generative Adversarial Networks with Gradient Penalty , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.