Hybrid Approaches for our Participation to the n2c2 Challenge on Cohort Selection for Clinical Trials

Objective: Natural language processing can help minimize human intervention in identifying patients meeting eligibility criteria for clinical trials, but there is still a long way to go to obtain a general and systematic approach that is useful for researchers. We describe two methods taking a step in this direction and present their results obtained during the n2c2 challenge on cohort selection for clinical trials. Materials and Methods: The first method is a weakly supervised method using an unlabeled corpus (MIMIC) to build a silver standard, by producing semi-automatically a small and very precise set of rules to detect some samples of positive and negative patients. This silver standard is then used to train a traditional supervised model. The second method is a terminology-based approach where a medical expert selects the appropriate concepts, and a procedure is defined to search the terms and check the structural or temporal constraints. Results: On the n2c2 dataset containing annotated data about 13 selection criteria on 288 patients, we obtained an overall F1-measure of 0.8969, which is the third best result out of 45 participant teams, with no statistically significant difference with the best-ranked team. Discussion: Both approaches obtained very encouraging results and apply to different types of criteria. The weakly supervised method requires explicit descriptions of positive and negative examples in some reports. The terminology-based method is very efficient when medical concepts carry most of the relevant information. Conclusion: It is unlikely that much more annotated data will be soon available for the task of identifying a wide range of patient phenotypes. One must focus on weakly or non-supervised learning methods using both structured and unstructured data and relying on a comprehensive representation of the patients.

[1]  James Pustejovsky,et al.  A Methodology for Using Professional Knowledge in Corpus , 2013 .

[2]  Hongfang Liu,et al.  A Study of Transportability of an Existing Smoking Status Detection Module across Institutions , 2012, AMIA.

[3]  Casey S. Greene,et al.  Semi-supervised learning of the electronic health record for phenotype stratification , 2016, J. Biomed. Informatics.

[4]  David Sontag,et al.  Electronic medical record phenotyping using the anchor and learn framework , 2016, J. Am. Medical Informatics Assoc..

[5]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[6]  Hua Xu,et al.  Data from clinical notes: a perspective on the tension between structure and flexible documentation , 2011, J. Am. Medical Informatics Assoc..

[7]  Jie Xu,et al.  Developing a data element repository to support EHR-driven phenotype algorithm authoring and execution , 2016, J. Biomed. Informatics.

[8]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[9]  Iñaki Soto Rey,et al.  Efficiency and effectiveness evaluation of an automated multi-country patient count cohort system , 2015, BMC Medical Research Methodology.

[10]  Eric Fosler-Lussier,et al.  Textual inference for eligibility criteria resolution in clinical trials , 2015, J. Biomed. Informatics.

[11]  Michael Gertz,et al.  Multilingual and cross-domain temporal tagging , 2012, Language Resources and Evaluation.

[12]  Martín Abadi,et al.  Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data , 2016, ICLR.

[13]  Hua Xu,et al.  Applying active learning to high-throughput phenotyping algorithms for electronic health records data. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[14]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[15]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[16]  Eric Zapletal,et al.  Leveraging the EHR4CR platform to support patient inclusion in academic studies: challenges and lessons learned , 2017, BMC Medical Research Methodology.

[17]  S. Mani,et al.  Extracting and integrating data from entire electronic health records for detecting colorectal cancer cases. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[18]  Hans-Ulrich Prokosch,et al.  Evaluation of data completeness in the electronic health record for the purpose of patient recruitment into clinical trials: a retrospective analysis of element presence , 2013, BMC Medical Informatics and Decision Making.

[19]  Cathy H. Wu,et al.  Noise Reduction Methods for Distantly Supervised Biomedical Relation Extraction , 2017, BioNLP.

[20]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[21]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[22]  Peter D. Stetson,et al.  Use of Semantic Features to Classify Patient Smoking Status , 2008, AMIA.

[23]  Christel Daniel-Le Bozec,et al.  Cross border semantic interoperability for clinical research: the EHR4CR semantic resources and services , 2016, CRI.

[24]  Paul A. Harris,et al.  PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability , 2016, J. Am. Medical Informatics Assoc..

[25]  Eneida A. Mendonça,et al.  Relational machine learning for electronic health record-driven phenotyping , 2014, J. Biomed. Informatics.

[26]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[27]  Melissa A. Basford,et al.  The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future , 2013, Genetics in Medicine.

[28]  J. Denny,et al.  Naïve Electronic Health Record phenotype identification for Rheumatoid arthritis. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[29]  R Bache,et al.  Piloting the EHR4CR Feasibility Platform across Europe , 2014, Methods of Information in Medicine.

[30]  Xiaolong Wang,et al.  Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation , 2015, IJCAI.

[31]  Hua Xu,et al.  Portability of an algorithm to identify rheumatoid arthritis in electronic health records , 2012, J. Am. Medical Informatics Assoc..

[32]  Marylyn D. Ritchie,et al.  Electronic medical records and genomics (eMERGE) network exploration in cataract: Several new potential susceptibility loci , 2014, Molecular vision.

[33]  I. Kohane,et al.  Electronic medical records for discovery research in rheumatoid arthritis , 2010, Arthritis care & research.

[34]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[35]  Melissa A. Basford,et al.  Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data , 2013, Nature Biotechnology.

[36]  Chunhua Weng,et al.  Dynamic categorization of clinical research eligibility criteria by hierarchical clustering , 2011, J. Biomed. Informatics.

[37]  J. Denny,et al.  Extracting research-quality phenotypes from electronic health records to support precision medicine , 2015, Genome Medicine.

[38]  Chunhua Weng,et al.  Formal representation of eligibility criteria: A literature review , 2010, J. Biomed. Informatics.

[39]  Nigam H. Shah,et al.  Learning statistical models of phenotypes using noisy labeled training data , 2016, J. Am. Medical Informatics Assoc..

[40]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[41]  Stephen B. Johnson,et al.  A review of approaches to identifying patient phenotype cohorts using electronic health records , 2013, J. Am. Medical Informatics Assoc..