An Open Natural Language Processing Development Framework for EHR-based Clinical Research: A case demonstration using the National COVID Cohort Collaborative (N3C)

While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, Interpretability and usability. Built upon our previous work, in this study, we proposed an open natural language processing development framework and evaluated it through the implementation of NLP algorithms for the National COVID Cohort Collaborative (N3C). Based on the interests in information extraction from COVID-19 related clinical notes, our work includes 1) an open data annotation process using COVID-19 signs and symptoms as the use case, 2) a community-driven ruleset composing platform, and 3) a synthetic text data generation workflow to generate texts for information extraction tasks without involving human subjects. The generated corpora derived out of the texts from multiple intuitions and gold standard annotation are tested on a single institution's rule set has the performances in F1 score of 0.876, 0.706 and 0.694, respectively. The study as a consortium effort of the N3C NLP subgroup demonstrates the feasibility of creating a federated NLP algorithm development and benchmarking platform to enhance multi-institution clinical NLP study.

[1]  Peng Jin,et al.  Criteria2Query: a natural language interface to clinical databases for cohort definition , 2019, J. Am. Medical Informatics Assoc..

[2]  Melissa A. Basford,et al.  The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future , 2013, Genetics in Medicine.

[3]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[4]  Hongfang Liu,et al.  A corpus-driven standardization framework for encoding clinical problems with HL7 FHIR , 2020, J. Biomed. Informatics.

[5]  U. Topaloglu,et al.  Challenges in defining Long COVID: Striking differences across literature, Electronic Health Records, and patient-reported information , 2021, medRxiv.

[6]  Yang Wang,et al.  CIDO, a community-based ontology for coronavirus disease knowledge and data integration, sharing, and analysis , 2020, Scientific Data.

[7]  U. Topaloglu,et al.  Outcomes of COVID-19 in Patients With Cancer: Report From the National COVID Cohort Collaborative (N3C) , 2021, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[8]  Philip R. O. Payne,et al.  The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment , 2020, J. Am. Medical Informatics Assoc..

[9]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[10]  Christopher G Chute,et al.  The Human Phenotype Ontology in 2021 , 2020, Nucleic Acids Res..

[11]  E. Brown,et al.  The Medical Dictionary for Regulatory Activities (MedDRA) , 1999, Drug safety.

[12]  Hongfang Liu,et al.  Journal of Biomedical Informatics , 2022 .

[13]  Rebecca Herold,et al.  HIPAA Privacy Rule , 2014 .

[14]  Sunghwan Sohn,et al.  Clinical concept extraction: A methodology review , 2020, J. Biomed. Informatics.

[15]  Francisco Herrera,et al.  Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI , 2020, Inf. Fusion.

[16]  Wendy W. Chapman,et al.  ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports , 2009, J. Biomed. Informatics.

[17]  Tudor Groza,et al.  The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species , 2016, bioRxiv.