AI- and HPC-enabled Lead Generation for SARS-CoV-2: Models and Processes to Extract Druglike Molecules Contained in Natural Language Text

Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of coronavirus research. We report here on a project that leverages both human and artificial intelligence to detect references to drug-like molecules in free text. We engage non-expert humans to create a corpus of labeled text, use this labeled corpus to train a named entity recognition model, and employ the trained model to extract 10 912 drug-like molecules from the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198 875 papers. Performance analyses show that our automated extraction model can achieve performance on par with that of non-expert humans.

[1]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[2]  Ian T. Foster,et al.  Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort , 2019, ICCS.

[3]  S. Günther,et al.  Evaluation of Antiviral Efficacy of Ribavirin, Arbidol, and T-705 (Favipiravir) in a Mouse Model for Crimean-Congo Hemorrhagic Fever , 2014, PLoS neglected tropical diseases.

[4]  Kyle Chard,et al.  Active Learning Yields Better Training Data for Scientific Named Entity Recognition , 2019, 2019 15th International Conference on eScience (eScience).

[5]  Oren Etzioni,et al.  CORD-19: The Covid-19 Open Research Dataset , 2020, NLPCOVID19.

[6]  Steven Tuecke,et al.  DLHub: Simplifying publication, discovery, and use of machine learning models in science , 2021, J. Parallel Distributed Comput..

[7]  David S. Wishart,et al.  DrugBank 5.0: a major update to the DrugBank database for 2018 , 2017, Nucleic Acids Res..

[8]  Zhe Zhu,et al.  The FDA‐approved drug sofosbuvir inhibits Zika virus infection , 2017, Antiviral research.

[9]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[10]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[11]  E. Dong,et al.  An interactive web-based dashboard to track COVID-19 in real time , 2020, The Lancet Infectious Diseases.

[12]  Kyle Chard,et al.  SciNER: Extracting Named Entities from Scientific Literature , 2020, ICCS.