Using large clinical corpora for query expansion in text-based cohort identification

In light of the heightened problems of polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification can be improved by using a large, in-domain clinical corpus for query expansion. We evaluate the utility of four auxiliary collections for the Text REtrieval Conference task of IR-based cohort retrieval, considering the effects of collection size, the inherent difficulty of a query, and the interaction between the collections. Each collection was applied to aid in cohort retrieval from the Pittsburgh NLP Repository by using a mixture of relevance models. Measured by mean average precision, performance using any auxiliary resource (MAP=0.386 and above) is shown to improve over the baseline query likelihood model (MAP=0.373). Considering subsets of the Mayo Clinic collection, we found that after including 2.5 billion term instances, retrieval is not improved by adding more instances. However, adding the Mayo Clinic collection did improve performance significantly over any existing setup, with a system using all four auxiliary collections obtaining the best results (MAP=0.4223). Because optimal results in the mixture of relevance models would require selective sampling of the collections, the common sense approach of "use all available data" is inappropriate. However, we found that it was still beneficial to add the Mayo corpus to any mixture of relevance models. On the task of IR-based cohort identification, query expansion with the Mayo Clinic corpus resulted in consistent and significant improvements. As such, any IR query expansion with access to a large clinical corpus could benefit from the additional resource. Additionally, we have shown that more data is not necessarily better, implying that there is value in collection curation.

[1]  Hongfang Liu,et al.  Empirical Ontologies for Cohort Identification , 2011, TREC.

[2]  Lijun Wang,et al.  Cengage Learning at TREC 2011 Medical Track , 2011, TREC.

[3]  Ben Carterette,et al.  Using Multiple External Collections for Query Expansion , 2011, TREC.

[4]  William Hersh,et al.  Comprar Information Retrieval: A Health And Biomedical Perspective | Hersh, William | 9780387787022 | Springer , 2009 .

[5]  William R. Hersh,et al.  Information Retrieval: A Health and Biomedical Perspective , 2002 .

[6]  Henry C. Chueh,et al.  Visual query tool for finding patient cohorts from a clinical data warehouse of the partners HealthCare system , 2000, AMIA.

[7]  William R. Hersh,et al.  Identifying Patients for Clinical Studies from Electronic Health Records: TREC 2012 Medical Records Track at OHSU , 2012, TREC.

[8]  Hao Wu,et al.  An Exploration of New Ranking Strategies for Medical Record Tracks , 2011, TREC.

[9]  Nicolette de Keizer,et al.  Forty years of SNOMED: a literature review , 2008, BMC Medical Informatics Decis. Mak..

[10]  Ellen M. Voorhees,et al.  Overview of the TREC 2012 Medical Records Track , 2012, TREC.

[11]  Wendy W. Chapman,et al.  Evaluation of negation phrases in narrative clinical reports , 2001, AMIA.

[12]  Ben Carterette,et al.  Exploring Evidence Aggregation Methods and External Expansion Sources for Medical Record Search , 2012, TREC.

[13]  Cui Tao,et al.  Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis , 2012, J. Am. Medical Informatics Assoc..

[14]  Sanda M. Harabagiu,et al.  Cohort Shepherd: Discoving Cohort Traits from Hospital Visits , 2011, TREC.

[15]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[16]  Joyce A. Mitchell,et al.  Evaluating the informatics for integrating biology and the bedside system for clinical research , 2009 .

[17]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[18]  Antonio Jimeno-Yepes,et al.  A Knowledge-Based Approach to Medical Records Retrieval , 2011, TREC.

[19]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[20]  Steve Renals,et al.  Proceedings of the Ninth Text REtrieval Conference , 2001 .

[21]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[22]  Dolf Trieschnigg,et al.  DutchHatTrick: Semantic Query Modeling, ConText, Section Detection, and Match Score Maximization , 2011, TREC.

[23]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[24]  Lei Yang,et al.  Query log analysis of an electronic health record search engine. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[25]  William R. Hersh Health and Biomedical Information , 2009 .

[26]  Leonard W D'Avolio,et al.  Comparative effectiveness research and medical informatics. , 2010, The American journal of medicine.

[27]  David A. Hanauer,et al.  EMERSE: The Electronic Medical Record Search Engine , 2006, AMIA.

[28]  Bruce E. Bray,et al.  A bootstrapping algorithm to improve cohort identification using structured data , 2011, J. Biomed. Informatics.

[29]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[30]  Wendy W. Chapman,et al.  ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports , 2009, J. Biomed. Informatics.

[31]  Yanjun Qi,et al.  Retrieving Medical Records with "sennamed": NEC Labs America at TREC 2012 Medical Record Track , 2012, TREC.

[32]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[33]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[34]  David A. Hanauer,et al.  Enhanced identification of eligibility for depression research using an electronic medical record search engine , 2009, Int. J. Medical Informatics.

[35]  William R. Hersh,et al.  TREC GENOMICS Track Overview , 2003, TREC.

[36]  Stephen T. Wu,et al.  Clinical Information Retrieval with Split-layer Language Models , 2013 .

[37]  Hongfang Liu,et al.  Semantic characteristics of NLP-extracted concepts in clinical notes vs. biomedical literature. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[38]  Xiangji Huang,et al.  York University at TREC 2011: Medical Records Track , 2011, TREC.

[39]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..