Criteria2Query: a natural language interface to clinical databases for cohort definition

Abstract Objective Cohort definition is a bottleneck for conducting clinical research and depends on subjective decisions by domain experts. Data-driven cohort definition is appealing but requires substantial knowledge of terminologies and clinical data models. Criteria2Query is a natural language interface that facilitates human-computer collaboration for cohort definition and execution using clinical databases. Materials and Methods Criteria2Query uses a hybrid information extraction pipeline combining machine learning and rule-based methods to systematically parse eligibility criteria text, transforms it first into a structured criteria representation and next into sharable and executable clinical data queries represented as SQL queries conforming to the OMOP Common Data Model. Users can interactively review, refine, and execute queries in the ATLAS web application. To test effectiveness, we evaluated 125 criteria across different disease domains from ClinicalTrials.gov and 52 user-entered criteria. We evaluated F1 score and accuracy against 2 domain experts and calculated the average computation time for fully automated query formulation. We conducted an anonymous survey evaluating usability. Results Criteria2Query achieved 0.795 and 0.805 F1 score for entity recognition and relation extraction, respectively. Accuracies for negation detection, logic detection, entity normalization, and attribute normalization were 0.984, 0.864, 0.514 and 0.793, respectively. Fully automatic query formulation took 1.22 seconds/criterion. More than 80% (11+ of 13) of users would use Criteria2Query in their future cohort definition tasks. Conclusions We contribute a novel natural language interface to clinical databases. It is open source and supports fully automated and interactive modes for autonomous data-driven cohort definition by researchers with minimal human effort. We demonstrate its promising user friendliness and usability.

[1]  Lawrence M. Fagan,et al.  Knowledge engineering for a clinical trial advice system: uncovering errors in protocol specification. , 1987, Bulletin du cancer.

[2]  Xiaoying Wu,et al.  EliXR: an approach to eligibility criteria extraction and representation , 2011, J. Am. Medical Informatics Assoc..

[3]  David W. Embley,et al.  Generating Medical Logic Modules for Clinical Trial Eligibility Criteria , 2003, AMIA.

[4]  Mor Peleg,et al.  A practical method for transforming free-text eligibility criteria into computable criteria , 2011, J. Biomed. Informatics.

[5]  Michael N. Cantor,et al.  Analysis of eligibility criteria representation in industry-standard clinical trial protocols , 2013, J. Biomed. Informatics.

[6]  Karen Spärck Jones,et al.  Natural language interfaces to databases , 1990, The Knowledge Engineering Review.

[7]  Hongfang Liu,et al.  Valx: A System for Extracting and Structuring Numeric Lab Test Comparison Statements from Text , 2016, Methods of Information in Medicine.

[8]  Chunhua Weng,et al.  EliIE: An open-source information extraction system for clinical trial eligibility criteria , 2017, J. Am. Medical Informatics Assoc..

[9]  Kirk Roberts,et al.  Toward a Natural Language Interface for EHR Questions , 2015, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[10]  Chunhua Weng,et al.  Formal representation of eligibility criteria: A literature review , 2010, J. Biomed. Informatics.

[11]  Yu-Chuan Li,et al.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers , 2015, MedInfo.

[12]  Zhiyong Lu,et al.  An Inference Method for Disease Name Normalization , 2012, AAAI Fall Symposium: Information Retrieval and Knowledge Discovery in Biomedical Text.

[13]  Chunhua Weng,et al.  Case Report: Electronic Screening Improves Efficiency in Clinical Trial Recruitment , 2009, J. Am. Medical Informatics Assoc..

[14]  Kaija Saranto,et al.  Definition, structure, content, use and impacts of electronic health records: A review of the research literature , 2008, Int. J. Medical Informatics.

[15]  Peter Thanisch,et al.  Natural language interfaces to databases – an introduction , 1995, Natural Language Engineering.

[16]  Donald E. Walker,et al.  Natural Language Access To A Melanoma Data Base , 1978 .

[17]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[18]  Li Zhou,et al.  Mapping Partners Master Drug Dictionary to RxNorm using an NLP-based approach , 2012, J. Biomed. Informatics.

[19]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[20]  Angel X. Chang,et al.  SUTime: A library for recognizing and normalizing time expressions , 2012, LREC.

[21]  William A. Woods,et al.  Progress in natural language understanding: an application to lunar geology , 1973, AFIPS National Computer Conference.

[22]  L. Penberthy,et al.  Automated matching software for clinical trials eligibility: measuring efficiency and flexibility. , 2010, Contemporary clinical trials.

[23]  Christopher D. Manning,et al.  Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks , 2016, LREC.

[24]  Charles P. Friedman,et al.  Viewpoint Paper: A "Fundamental Theorem" of Biomedical Informatics , 2009, J. Am. Medical Informatics Assoc..

[25]  Jörg Tiedemann,et al.  Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12) , 2012 .

[26]  Rodolfo A. Pazos Rangel,et al.  Natural Language Interfaces to Databases: An Analysis of the State of the Art , 2013, Recent Advances on Hybrid Intelligent Systems.

[27]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[28]  Chunhua Weng,et al.  Optimizing Clinical Research Participant Selection with Informatics. , 2015, Trends in pharmacological sciences.

[29]  Chunhua Weng,et al.  EliXR-TIME: A Temporal Knowledge Representation for Clinical Research Eligibility Criteria , 2012, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[30]  J. DeShazo,et al.  Effort required in eligibility screening for clinical trials. , 2012, Journal of oncology practice.

[31]  B Hamel,et al.  A natural language interface to a clinical data base management system. , 1981, Computers and biomedical research, an international journal.