论文信息 - Hybrid Approaches for our Participation to the n2c2 Challenge on Cohort Selection for Clinical Trials - 字舞流文

Hybrid Approaches for our Participation to the n2c2 Challenge on Cohort Selection for Clinical Trials

Objective: Natural language processing can help minimize human intervention in identifying patients meeting eligibility criteria for clinical trials, but there is still a long way to go to obtain a general and systematic approach that is useful for researchers. We describe two methods taking a step in this direction and present their results obtained during the n2c2 challenge on cohort selection for clinical trials. Materials and Methods: The first method is a weakly supervised method using an unlabeled corpus (MIMIC) to build a silver standard, by producing semi-automatically a small and very precise set of rules to detect some samples of positive and negative patients. This silver standard is then used to train a traditional supervised model. The second method is a terminology-based approach where a medical expert selects the appropriate concepts, and a procedure is defined to search the terms and check the structural or temporal constraints. Results: On the n2c2 dataset containing annotated data about 13 selection criteria on 288 patients, we obtained an overall F1-measure of 0.8969, which is the third best result out of 45 participant teams, with no statistically significant difference with the best-ranked team. Discussion: Both approaches obtained very encouraging results and apply to different types of criteria. The weakly supervised method requires explicit descriptions of positive and negative examples in some reports. The terminology-based method is very efficient when medical concepts carry most of the relevant information. Conclusion: It is unlikely that much more annotated data will be soon available for the task of identifying a wide range of patient phenotypes. One must focus on weakly or non-supervised learning methods using both structured and unstructured data and relying on a comprehensive representation of the patients.

Christel Daniel-Le Bozec | Hugo Cisneros | Xavier Tannier | Nicolas Paris | Matthieu Doutreligne | Catherine Duclos | Nicolas Griffon | Claire Hassen-Khodja | Ivan Lerner | Adrien Parrot | Éric Sadou | Cyril Saussol | Pascal Vaillant | N. Griffon | Xavier Tannier | Hugo Cisneros | C. L. Bozec | Éric Sadou | C. Duclos | N. Paris | C. Hassen-Khodja | Pascal Vaillant | I. Lerner | A. Parrot | M. Doutreligne | Cyril Saussol

[1] James Pustejovsky,et al. A Methodology for Using Professional Knowledge in Corpus , 2013 .

[2] Hongfang Liu,et al. A Study of Transportability of an Existing Smoking Status Detection Module across Institutions , 2012, AMIA.

[3] Casey S. Greene,et al. Semi-supervised learning of the electronic health record for phenotype stratification , 2016, J. Biomed. Informatics.

[4] David Sontag,et al. Electronic medical record phenotyping using the anchor and learn framework , 2016, J. Am. Medical Informatics Assoc..

[5] Mihai Surdeanu,et al. The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[6] Hua Xu,et al. Data from clinical notes: a perspective on the tension between structure and flexible documentation , 2011, J. Am. Medical Informatics Assoc..

[7] Jie Xu,et al. Developing a data element repository to support EHR-driven phenotype algorithm authoring and execution , 2016, J. Biomed. Informatics.

[8] Olivier Bodenreider,et al. The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[9] Iñaki Soto Rey,et al. Efficiency and effectiveness evaluation of an automated multi-country patient count cohort system , 2015, BMC Medical Research Methodology.

[10] Eric Fosler-Lussier,et al. Textual inference for eligibility criteria resolution in clinical trials , 2015, J. Biomed. Informatics.

[11] Michael Gertz,et al. Multilingual and cross-domain temporal tagging , 2012, Language Resources and Evaluation.

[12] Martín Abadi,et al. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data , 2016, ICLR.

[13] Hua Xu,et al. Applying active learning to high-throughput phenotyping algorithms for electronic health records data. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[14] Sunghwan Sohn,et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[15] Li Li,et al. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[16] Eric Zapletal,et al. Leveraging the EHR4CR platform to support patient inclusion in academic studies: challenges and lessons learned , 2017, BMC Medical Research Methodology.

[17] S. Mani,et al. Extracting and integrating data from entire electronic health records for detecting colorectal cancer cases. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[18] Hans-Ulrich Prokosch,et al. Evaluation of data completeness in the electronic health record for the purpose of patient recruitment into clinical trials: a retrospective analysis of element presence , 2013, BMC Medical Informatics and Decision Making.

[19] Cathy H. Wu,et al. Noise Reduction Methods for Distantly Supervised Biomedical Relation Extraction , 2017, BioNLP.

[20] Peter Szolovits,et al. MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[21] Quoc V. Le,et al. Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[22] Peter D. Stetson,et al. Use of Semantic Features to Classify Patient Smoking Status , 2008, AMIA.

[23] Christel Daniel-Le Bozec,et al. Cross border semantic interoperability for clinical research: the EHR4CR semantic resources and services , 2016, CRI.

[24] Paul A. Harris,et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability , 2016, J. Am. Medical Informatics Assoc..

[25] Eneida A. Mendonça,et al. Relational machine learning for electronic health record-driven phenotyping , 2014, J. Biomed. Informatics.

[26] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.

[27] Melissa A. Basford,et al. The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future , 2013, Genetics in Medicine.

[28] J. Denny,et al. Naïve Electronic Health Record phenotype identification for Rheumatoid arthritis. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[29] R Bache,et al. Piloting the EHR4CR Feasibility Platform across Europe , 2014, Methods of Information in Medicine.

[30] Xiaolong Wang,et al. Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation , 2015, IJCAI.

[31] Hua Xu,et al. Portability of an algorithm to identify rheumatoid arthritis in electronic health records , 2012, J. Am. Medical Informatics Assoc..

[32] Marylyn D. Ritchie,et al. Electronic medical records and genomics (eMERGE) network exploration in cataract: Several new potential susceptibility loci , 2014, Molecular vision.

[33] I. Kohane,et al. Electronic medical records for discovery research in rheumatoid arthritis , 2010, Arthritis care & research.

[34] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[35] Melissa A. Basford,et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data , 2013, Nature Biotechnology.

[36] Chunhua Weng,et al. Dynamic categorization of clinical research eligibility criteria by hierarchical clustering , 2011, J. Biomed. Informatics.

[37] J. Denny,et al. Extracting research-quality phenotypes from electronic health records to support precision medicine , 2015, Genome Medicine.

[38] Chunhua Weng,et al. Formal representation of eligibility criteria: A literature review , 2010, J. Biomed. Informatics.

[39] Nigam H. Shah,et al. Learning statistical models of phenotypes using noisy labeled training data , 2016, J. Am. Medical Informatics Assoc..

[40] Charles Elkan,et al. Learning classifiers from only positive and unlabeled data , 2008, KDD.

[41] Stephen B. Johnson,et al. A review of approaches to identifying patient phenotype cohorts using electronic health records , 2013, J. Am. Medical Informatics Assoc..