The COUGHVID crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms

Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. However, there is currently no validated database of cough sounds with which to train such ML models. The COUGHVID dataset provides over 20,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. First, we filtered the dataset using our open-sourced cough detection algorithm. Second, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. Finally, we ensured that coughs labeled as symptomatic and COVID-19 originate from countries with high infection rates, and that their expert labels are consistent. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world's most urgent health crises.

[1]  Hanlee P. Ji,et al.  The COVID-19 XPRIZE and the need for scalable, fast, and widespread testing , 2020, Nature Biotechnology.

[2]  Jay S. Steingrub,et al.  Symptom Duration and Risk Factors for Delayed Return to Usual Health Among Outpatients with COVID-19 in a Multistate Health Care Systems Network — United States, March–June 2020 , 2020, MMWR. Morbidity and mortality weekly report.

[3]  Silvia Pfeiffer The Ogg Encapsulation Format Version 0 , 2003, RFC.

[4]  Renard Xaviero Adhi Pramono,et al.  A Cough-Based Algorithm for Automatic Diagnosis of Pertussis , 2016, PloS one.

[5]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[6]  Rich Salz,et al.  A Universally Unique IDentifier (UUID) URN Namespace , 2005, RFC.

[7]  Muhammad Nabeel,et al.  AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app , 2020, Informatics in Medicine Unlocked.

[8]  Frank Knoefel,et al.  Feature extraction for the differentiation of dry and wet cough sounds , 2011, 2011 IEEE International Symposium on Medical Measurements and Applications.

[9]  H. Rothan,et al.  The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak , 2020, Journal of Autoimmunity.

[10]  Timothy B. Terriberry,et al.  Definition of the Opus Audio Codec , 2012, RFC.

[11]  M. Salathé,et al.  COVID-19 epidemic in Switzerland: on the importance of testing, contact tracing and isolation. , 2020, Swiss medical weekly.

[12]  Yusuf A. Amrulloh,et al.  Cough Sound Analysis for Pneumonia and Asthma Classification in Pediatric Population , 2015, 2015 6th International Conference on Intelligent Systems, Modelling and Simulation.

[13]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[14]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Tim Bray,et al.  Internet Engineering Task Force (ietf) the Javascript Object Notation (json) Data Interchange Format , 2022 .

[16]  V. K. Mittal,et al.  IIIT-S CSSD: A Cough Speech Sounds Database , 2016, 2016 Twenty Second National Conference on Communication (NCC).

[17]  Philip J. Rosenthal,et al.  The Importance of Diagnostic Testing during a Viral Pandemic: Early Lessons from Novel Coronavirus Disease (COVID-19) , 2020, The American journal of tropical medicine and hygiene.

[18]  Jim Bankoski Intro to WebM , 2011, NOSSDAV '11.

[19]  Cecilia Mascolo,et al.  Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data , 2020, KDD.