The COUGHVID crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms

Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. The COUGHVID dataset provides over 25,000 crowdsourced cough recordings representing a wide range of participant ages, genders, geographic locations, and COVID-19 statuses. First, we contribute our open-sourced cough detection algorithm to the research community to assist in data robustness assessment. Second, four experienced physicians labeled more than 2,800 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. Finally, we ensured that coughs labeled as symptomatic and COVID-19 originate from countries with high infection rates. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world’s most urgent health crises. Measurement(s) Cough Technology Type(s) Microphone Device Factor Type(s) COVID-19 status • location • age • gender • respiratory condition Sample Characteristic - Organism Homo sapiens Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.14377019

[1]  Jim Bankoski Intro to WebM , 2011, NOSSDAV '11.

[2]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[3]  Rafik Goubran,et al.  Novel Coronavirus Cough Database: NoCoCoDa , 2020, IEEE Access.

[4]  V. K. Mittal,et al.  IIIT-S CSSD: A Cough Speech Sounds Database , 2016, 2016 Twenty Second National Conference on Communication (NCC).

[5]  Jay S. Steingrub,et al.  Symptom Duration and Risk Factors for Delayed Return to Usual Health Among Outpatients with COVID-19 in a Multistate Health Care Systems Network — United States, March–June 2020 , 2020, MMWR. Morbidity and mortality weekly report.

[6]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[7]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[8]  Silvia Pfeiffer The Ogg Encapsulation Format Version 0 , 2003, RFC.

[9]  H. Rothan,et al.  The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak , 2020, Journal of Autoimmunity.

[10]  R. Goubran,et al.  Novel Coronavirus (2019) Cough Database: NoCoCoDa , 2020 .

[11]  Yusuf A. Amrulloh,et al.  Cough Sound Analysis for Pneumonia and Asthma Classification in Pediatric Population , 2015, 2015 6th International Conference on Intelligent Systems, Modelling and Simulation.

[12]  Hanlee P. Ji,et al.  The COVID-19 XPRIZE and the need for scalable, fast, and widespread testing , 2020, Nature Biotechnology.

[13]  Philip J. Rosenthal,et al.  The Importance of Diagnostic Testing during a Viral Pandemic: Early Lessons from Novel Coronavirus Disease (COVID-19) , 2020, The American journal of tropical medicine and hygiene.

[14]  Cecilia Mascolo,et al.  Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data , 2020, KDD.

[15]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Frank Knoefel,et al.  Feature extraction for the differentiation of dry and wet cough sounds , 2011, 2011 IEEE International Symposium on Medical Measurements and Applications.

[17]  Renard Xaviero Adhi Pramono,et al.  A Cough-Based Algorithm for Automatic Diagnosis of Pertussis , 2016, PloS one.

[18]  Tim Bray,et al.  Internet Engineering Task Force (ietf) the Javascript Object Notation (json) Data Interchange Format , 2022 .

[19]  A. Chang The physiology of cough. , 2006, Paediatric respiratory reviews.

[20]  Pablo Casaseca-de-la-Higuera,et al.  A Machine Hearing System for Robust Cough Detection Based on a High-Level Representation of Band-Specific Audio Features , 2019, IEEE Transactions on Biomedical Engineering.

[21]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[22]  Sridhar Krishnan,et al.  Trends in audio signal feature extraction methods , 2020 .

[23]  Muhammad Nabeel,et al.  AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app , 2020, Informatics in Medicine Unlocked.

[24]  Rich Salz,et al.  A Universally Unique IDentifier (UUID) URN Namespace , 2005, RFC.

[25]  Timothy B. Terriberry,et al.  Definition of the Opus Audio Codec , 2012, RFC.

[26]  M. Salathé,et al.  COVID-19 epidemic in Switzerland: on the importance of testing, contact tracing and isolation. , 2020, Swiss medical weekly.