Learning relevance models for patient cohort retrieval

Objective We explored how judgements provided by physicians can be used to learn relevance models that enhance the quality of patient cohorts retrieved from Electronic Health Records (EHRs) collections. Methods A very large number of features were extracted from patient cohort descriptions as well as EHR collections. The features were used to investigate retrieving (1) neurology-specific patient cohorts from the de-identified Temple University Hospital electroencephalography (EEG) Corpus as well as (2) the more general cohorts evaluated in the TREC Medical Records Track (TRECMed) from the de-identified hospital records provided by the University of Pittsburgh Medical Center. The features informed a learning relevance model (LRM) that took advantage of relevance judgements provided by physicians. The LRM implements a pairwise learning-to-rank framework, which enables our learning patient cohort retrieval (L-PCR) system to learn from physicians' feedback. Results and Discussion We evaluated the L-PCR system against state-of-the-art traditional patient cohort retrieval systems, and observed a 27% improvement when operating on EEGs and a 53% improvement when operating on TRECMed EHRs, showing the promise of the L-PCR system. We also performed extensive feature analyses to reveal the most effective strategies for representing cohort descriptions as queries, encoding EHRs, and measuring cohort relevance. Conclusion The L-PCR system has significant promise for reliably retrieving patient cohorts from EHRs in multiple settings when trained with relevance judgments. When provided with additional cohort descriptions, the L-PCR system will continue to learn, thus offering a potential solution to the performance barriers of current cohort retrieval systems.

[1]  Ellen M. Voorhees The TREC Medical Records Track , 2013, BCB.

[2]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[3]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[4]  Sanda M. Harabagiu,et al.  Cohort Shepherd: Discoving Cohort Traits from Hospital Visits , 2011, TREC.

[5]  Stephen B. Johnson,et al.  A review of approaches to identifying patient phenotype cohorts using electronic health records , 2013, J. Am. Medical Informatics Assoc..

[6]  Eugen Trinka,et al.  Unified EEG terminology and criteria for nonconvulsive status epilepticus , 2013, Epilepsia.

[7]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[8]  Sanda M. Harabagiu,et al.  Cohort Sherpherd II: Verifying Cohort Constraints from Hospital Visits , 2012, TREC.

[9]  Meng Zhao,et al.  Epilepsy and seizure ontology: towards an epilepsy informatics infrastructure for clinical research and patient care , 2014, J. Am. Medical Informatics Assoc..

[10]  Hongfang Liu,et al.  Three Questions About Clinical Information Retrieval , 2012, TREC.

[11]  S. Benbadis,et al.  Handbook of EEG Interpretation , 2007 .

[12]  Joseph Picone,et al.  The Temple University Hospital EEG corpus , 2013, 2013 IEEE Global Conference on Signal and Information Processing.

[13]  Li Li,et al.  Automated disease cohort selection using word embeddings from Electronic Health Records , 2018, PSB.

[14]  Sanda M. Harabagiu,et al.  Automatic Generation of a Qualified Medical Knowledge Graph and Its Usage for Retrieving Patient Cohorts from Electronic Medical Records , 2013, 2013 IEEE Seventh International Conference on Semantic Computing.

[15]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[16]  David Martínez,et al.  Search for Medical Records: NICTA at TREC 2011 Medical Track , 2011, TREC.

[17]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[18]  Sanda M. Harabagiu,et al.  The Impact of Belief Values on the Identification of Patient Cohorts , 2013, CLEF.

[19]  Ben He,et al.  Terrier : A High Performance and Scalable Information Retrieval Platform , 2022 .

[20]  Joseph Picone,et al.  The Temple University Hospital EEG Data Corpus , 2016, Front. Neurosci..

[21]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[22]  Florentino Fernández Riverola,et al.  Medical-Miner at TREC 2011 Medical Records Track , 2011, TREC.

[23]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[24]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[25]  Veronika Vincze,et al.  Linguistic scope-based and biological event-based speculation and negation annotations in the BioScope and Genia Event corpora , 2011, J. Biomed. Semant..

[26]  K. Bretonnel Cohen,et al.  MetaMap is a Superior Baseline to a Standard Document Retrieval Engine for the Task of Finding Patient Cohorts in Clinical Free Text , 2011, TREC.

[27]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[28]  Sanda M. Harabagiu,et al.  A flexible framework for deriving assertions from electronic medical records , 2011, J. Am. Medical Informatics Assoc..

[29]  William R. Hersh,et al.  Barriers to Retrieving Patient Information from Electronic Health Record Data: Failure Analysis from the TREC Medical Records Track , 2012, AMIA.

[30]  William R. Hersh,et al.  Identifying Patients for Clinical Studies from Electronic Health Records: TREC 2012 Medical Records Track at OHSU , 2012, TREC.

[31]  Arantxa Otegi,et al.  Improving search over Electronic Health Records using UMLS-based query expansion through random walks , 2014, J. Biomed. Informatics.

[32]  Tao Qin,et al.  LETOR: A benchmark collection for research on learning to rank for information retrieval , 2010, Information Retrieval.

[33]  Ellen M. Voorhees,et al.  Overview of the TREC 2012 Medical Records Track , 2012, TREC.

[34]  Sanda M. Harabagiu,et al.  Multi-modal Patient Cohort Identification from EEG Report and Signal Data , 2016, AMIA.

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  K. Jellinger,et al.  Practical Guide for Clinical Neurophysiologic Testing: EEG , 2009 .

[37]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[38]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[39]  Kent A. Spackman,et al.  SNOMED clinical terms: overview of the development process and project status , 2001, AMIA.

[40]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[41]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.