How do we share data in COVID-19 research? A systematic review of COVID-19 datasets in PubMed Central Articles

Abstract Objective This study aims at reviewing novel coronavirus disease (COVID-19) datasets extracted from PubMed Central articles, thus providing quantitative analysis to answer questions related to dataset contents, accessibility and citations. Methods We downloaded COVID-19-related full-text articles published until 31 May 2020 from PubMed Central. Dataset URL links mentioned in full-text articles were extracted, and each dataset was manually reviewed to provide information on 10 variables: (1) type of the dataset, (2) geographic region where the data were collected, (3) whether the dataset was immediately downloadable, (4) format of the dataset files, (5) where the dataset was hosted, (6) whether the dataset was updated regularly, (7) the type of license used, (8) whether the metadata were explicitly provided, (9) whether there was a PubMed Central paper describing the dataset and (10) the number of times the dataset was cited by PubMed Central articles. Descriptive statistics about these seven variables were reported for all extracted datasets. Results We found that 28.5% of 12 324 COVID-19 full-text articles in PubMed Central provided at least one dataset link. In total, 128 unique dataset links were mentioned in 12 324 COVID-19 full text articles in PubMed Central. Further analysis showed that epidemiological datasets accounted for the largest portion (53.9%) in the dataset collection, and most datasets (84.4%) were available for immediate download. GitHub was the most popular repository for hosting COVID-19 datasets. CSV, XLSX and JSON were the most popular data formats. Additionally, citation patterns of COVID-19 datasets varied depending on specific datasets. Conclusion PubMed Central articles are an important source of COVID-19 datasets, but there is significant heterogeneity in the way these datasets are mentioned, shared, updated and cited.

[1]  Hannah R. Meredith,et al.  The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application , 2020, Annals of Internal Medicine.

[2]  Quentin J. Leclerc,et al.  Quantifying the impact of physical distance measures on the transmission of COVID-19 in the UK , 2020, BMC Medicine.

[3]  R. Rubin Global Effort to Collect Data on Ventilated Patients With COVID-19. , 2020, JAMA.

[4]  F. Cheng,et al.  Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2 , 2020, Cell Discovery.

[5]  Chloe H. Lee,et al.  In silico identification of vaccine targets for 2019-nCoV , 2020, F1000Research.

[6]  L. Ohno-Machado,et al.  Coronavirus: indexed data speed up solutions , 2020, Nature.

[7]  Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set (Preprint) , 2020 .

[8]  Zhiyong Lu,et al.  Keep up with the latest coronavirus research , 2020, Nature.

[9]  L. Meyers,et al.  Serial Interval of COVID-19 among Publicly Reported Confirmed Cases , 2020, Emerging infectious diseases.

[10]  Artem Cherkasov,et al.  Rapid Identification of Potential Inhibitors of SARS‐CoV‐2 Main Protease by Deep Docking of 1.3 Billion Compounds , 2020, Molecular informatics.

[11]  Colin Renfrew,et al.  Phylogenetic network analysis of SARS-CoV-2 genomes , 2020, Proceedings of the National Academy of Sciences.

[12]  Martina Stockhause,et al.  The TRUST Principles for digital repositories , 2020, Scientific Data.

[13]  Kristina Lerman,et al.  Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set , 2020, JMIR public health and surveillance.

[14]  F. Alakwaa Repurposing Didanosine as a Potential Treatment for COVID-19 Using Single-Cell RNA Sequencing Data , 2020, mSystems.

[15]  C. Faes,et al.  Estimating the generation interval for coronavirus disease (COVID-19) based on symptom onset data, March 2020 , 2020, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[16]  T. Chakraborty,et al.  Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: A data-driven analysis , 2020, Chaos, Solitons & Fractals.

[17]  Yan Bai,et al.  A fully automatic deep learning system for COVID-19 diagnostic and prognostic analysis , 2020, European Respiratory Journal.

[18]  Oren Etzioni,et al.  CORD-19: The Covid-19 Open Research Dataset , 2020, NLPCOVID19.

[19]  Nuno R. Faria,et al.  The effect of human mobility and control measures on the COVID-19 epidemic in China , 2020, medRxiv.

[20]  N. Linton,et al.  Incubation Period and Other Epidemiological Characteristics of 2019 Novel Coronavirus Infections with Right Truncation: A Statistical Analysis of Publicly Available Case Data , 2020, medRxiv.

[21]  Fabian J Theis,et al.  SARS-CoV-2 Receptor ACE2 Is an Interferon-Stimulated Gene in Human Airway Epithelial Cells and Is Detected in Specific Cell Subsets across Tissues , 2020, Cell.

[22]  Zheng Kou,et al.  Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus , 2020, Infectious Diseases of Poverty.

[23]  L. Meyers,et al.  Risk for Transportation of Coronavirus Disease from Wuhan to Other Cities in China , 2020, Emerging infectious diseases.

[24]  Syed Faraz Ahmed,et al.  Preliminary Identification of Potential Vaccine Targets for the COVID-19 Coronavirus (SARS-CoV-2) Based on SARS-CoV Immunological Studies , 2020, Viruses.

[25]  Jeffrey D Sachs,et al.  Projecting hospital utilization during the COVID-19 outbreaks in the United States , 2020, Proceedings of the National Academy of Sciences.

[26]  J. M. Gomes,et al.  Characterization of the COVID-19 pandemic and the impact of uncertainties, mitigation strategies, and underreporting of cases in South Korea, Italy, and Brazil , 2020, Chaos, Solitons & Fractals.

[27]  Projected early spread of COVID-19 in Africa through 1 June 2020 , 2020, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[28]  C. Althaus,et al.  Pattern of early human-to-human transmission of Wuhan 2019 novel coronavirus (2019-nCoV), December 2019 to January 2020 , 2020, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[29]  Ruiyun Li,et al.  Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2) , 2020, Science.

[30]  Riley O. Mummah,et al.  Estimated effectiveness of symptom and risk screening to prevent the spread of COVID-19 , 2020, eLife.

[31]  Martin Stahl,et al.  Inhibition of SARS-CoV-2 Infections in Engineered Human Tissues Using Clinical-Grade Soluble Human ACE2 , 2020, Cell.

[32]  T. Hale,et al.  Oxford COVID-19 Government Response Tracker , 2020 .

[33]  Emilie Filmer-Wilson,et al.  The United Nations Population Fund , 2018 .

[34]  Isaac I. Bogoch,et al.  Coast-to-Coast Spread of SARS-CoV-2 during the Early Epidemic in the United States , 2020, Cell.

[35]  Gintaras Deikus,et al.  Introductions and early spread of SARS-CoV-2 in the New York City area , 2020, Science.

[36]  B. Singer,et al.  Impact of international travel and border control measures on the global spread of the novel 2019 coronavirus outbreak , 2020, Proceedings of the National Academy of Sciences.

[37]  Labode Popoola,et al.  ONLINE FORECASTING OF COVID-19 CASES IN NIGERIA USING LIMITED DATA , 2020, Data in Brief.

[38]  P. Klepac,et al.  Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts , 2020, The Lancet Global Health.

[39]  Sumiko Mekaru,et al.  Open access epidemiological data from the COVID-19 outbreak , 2020, The Lancet Infectious Diseases.

[40]  C. Whittaker,et al.  Estimates of the severity of coronavirus disease 2019: a model-based analysis , 2020, The Lancet Infectious Diseases.

[41]  Roland Eils,et al.  SARS‐CoV‐2 receptor ACE2 and TMPRSS2 are primarily expressed in bronchial transient secretory cells , 2020, The EMBO journal.

[42]  Stephen E. Chick,et al.  ICU capacity management during the COVID-19 pandemic using a process simulation , 2020, Intensive Care Medicine.

[43]  Paul Kaufmann,et al.  COVID-19: A Survey on Public Medical Imaging Data Resources , 2020, 2004.04569.

[44]  N. Seixas,et al.  Estimating the burden of United States workers exposed to infection or disease: A key factor in containing risk of COVID-19 infection , 2020, PloS one.

[45]  F. Arnaud,et al.  From core referencing to data re-use: two French national initiatives to reinforce paleodata stewardship (National Cyber Core Repository and LTER France Retro-Observatory) , 2017 .

[46]  H. Koohy,et al.  In silico identification of vaccine targets for 2019-nCoV. , 2020, F1000Research.

[47]  P. Klepac,et al.  Early dynamics of transmission and control of COVID-19: a mathematical modelling study , 2020, The Lancet Infectious Diseases.

[48]  Yong-Yeol Ahn,et al.  Evidence from internet search data shows information-seeking responses to news of local COVID-19 cases , 2020, Proceedings of the National Academy of Sciences.

[49]  T. Huynh Data for understanding the risk perception of COVID-19 from Vietnamese sample , 2020, Data in Brief.

[50]  Hyeshik Chang,et al.  The Architecture of SARS-CoV-2 Transcriptome , 2020, Cell.

[51]  R. Eggo,et al.  Estimating the infection and case fatality ratio for coronavirus disease (COVID-19) using age-adjusted data from the outbreak on the Diamond Princess cruise ship, February 2020 , 2020, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[52]  N. Linton,et al.  Serial interval of novel coronavirus (COVID-19) infections , 2020, International Journal of Infectious Diseases.

[53]  R. Cumming,et al.  Importance of collecting data on socioeconomic determinants from the early stage of the COVID-19 outbreak onwards , 2020, Journal of Epidemiology & Community Health.

[54]  Leticia Elizabeth Romero-García,et al.  Dataset on dynamics of Coronavirus on Twitter , 2020, Data in Brief.

[55]  Raquel de M. Barbosa,et al.  Chaos game representation dataset of SARS-CoV-2 genome , 2020, Data in Brief.

[56]  E. Segal,et al.  A framework for identifying regional outbreak and spread of COVID-19 from one-minute population-wide surveys , 2020, Nature Medicine.

[57]  Tuija Muhonen Bank , 2020, Definitions.

[58]  Jinoos Yazdany,et al.  The COVID-19 Global Rheumatology Alliance: collecting data in a pandemic , 2020, Nature Reviews Rheumatology.

[59]  E. Dong,et al.  An interactive web-based dashboard to track COVID-19 in real time , 2020, The Lancet Infectious Diseases.

[60]  Jessica T Davis,et al.  The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak , 2020, Science.

[61]  R. Trimble COVID-19 Dashboard , 2020 .

[62]  S. Funk,et al.  The transmissibility of novel Coronavirus in the early stages of the 2019-20 outbreak in Wuhan: Exploring initial point-source exposure sizes and durations using scenario analysis , 2020, Wellcome open research.

[63]  T. Alamo,et al.  Open Data Resources for Fighting COVID-19 , 2020, 2004.06111.

[64]  C. Viboud,et al.  Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study , 2020, The Lancet Digital Health.

[65]  Laura Burattini,et al.  COVID-19 in Italy: Dataset of the Italian Civil Protection Department , 2020, Data in Brief.

[66]  Wenjun Ma,et al.  Genomic Epidemiology of SARS-CoV-2 in Guangdong Province, China , 2020, Cell.

[67]  M. Shi,et al.  Transcriptomic characteristics of bronchoalveolar lavage fluid and peripheral blood mononuclear cells in COVID-19 patients , 2020, Emerging microbes & infections.

[68]  Ruifu Yang,et al.  An investigation of transmission control measures during the first 50 days of the COVID-19 epidemic in China , 2020, Science.

[69]  Assessing differential impacts of COVID-19 on black communities , 2020, Annals of Epidemiology.

[70]  Quentin J. Leclerc,et al.  Quantifying the impact of physical distance measures on the transmission of COVID-19 in the UK , 2020, BMC Medicine.

[71]  Nuno R. Faria,et al.  The effect of human mobility and control measures on the COVID-19 epidemic in China , 2020, Science.

[72]  K. Cao,et al.  Using Artificial Intelligence to Detect COVID-19 and Community-acquired Pneumonia Based on Pulmonary CT: Evaluation of the Diagnostic Accuracy , 2020 .

[73]  R. Evans European Centre for Disease Prevention and Control. , 2014, Nursing standard (Royal College of Nursing (Great Britain) : 1987).