Learning to combine representations for medical records search

The complexity of medical terminology raises challenges when searching medical records. For example, 'cancer', 'tumour', and 'neoplasms', which are synonyms, may prevent a traditional search system from retrieving relevant records that contain only synonyms of the query terms. Prior works use bag-of-concepts approaches, to deal with this by representing medical terms sharing the same meanings using concepts from medical resources (e.g. MeSH). The relevance scores are then combined with a traditional bag-of-words representation, when inferring the relevance of medical records. Even though the existing approaches are effective, the predicted retrieval effectiveness of either the bag-of-words or bag-of-concepts representation, which may be used to effectively model the score combination and hence improve retrieval performance, is not taken into account. In this paper, we propose a novel learning framework that models the importance of the bag-of-words and the bag-of-concepts representations, combining their scores on a per-query basis. Our proposed framework leverages retrieval performance predictors, such as the clarity score and AvIDF, calculated on both representations as learning features. We evaluate our proposed framework using the TREC Medical Records track's test collections. As our proposed framework can significantly outperform an existing approach that linearly merges the relevance scores, we conclude that retrieval performance predictors can be effectively leveraged when combining the relevance scores.

[1]  Stephen Tyree,et al.  Parallel boosted regression trees for web search ranking , 2011, WWW.

[2]  Falk Scholer,et al.  Effective Pre-retrieval Query Performance Prediction Using Similarity and Variability Evidence , 2008, ECIR.

[3]  Zheng Chen,et al.  A novel local patch framework for fixing supervised learning models , 2012, CIKM.

[4]  Giorgio Gambosi,et al.  FUB, IASI-CNR and University of Tor Vergata at TREC 2008 Blog Track , 2008, TREC.

[5]  Iadh Ounis,et al.  Query performance prediction , 2006, Inf. Syst..

[6]  Craig MacDonald,et al.  Disambiguating biomedical acronyms using EMIM , 2011, SIGIR '11.

[7]  Craig MacDonald,et al.  Inferring conceptual relationships to improve medical records search , 2013, OAIR.

[8]  Craig MacDonald,et al.  Voting for candidates: adapting data fusion techniques for an expert search task , 2006, CIKM '06.

[9]  Craig MacDonald,et al.  University of Glasgow at Medical Records Track: Experiments with Terrier , 2011, TREC.

[10]  William R. Hersh,et al.  Research Paper: A Performance and Failure Analysis of SAPHIRE with a MEDLINE Test Collection , 1994, J. Am. Medical Informatics Assoc..

[11]  Ellen M. Voorhees,et al.  Overview of the TREC 2012 Medical Records Track , 2012, TREC.

[12]  Ben He,et al.  Terrier : A High Performance and Scalable Information Retrieval Platform , 2022 .

[13]  Elad Yom-Tov,et al.  Estimating the query difficulty for information retrieval , 2010, Synthesis Lectures on Information Concepts, Retrieval, and Services.

[14]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[15]  Djoerd Hiemstra,et al.  A cross-lingual framework for monolingual biomedical information retrieval , 2010, CIKM.

[16]  Padmini Srinivasan,et al.  Optimal Document-Indexing Vocabulary for MEDLINE , 1996, Inf. Process. Manag..

[17]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[18]  Craig MacDonald,et al.  A Task-Specific Query and Document Representation for Medical Records Search , 2013, ECIR.

[19]  Cristina V. Lopes,et al.  Bagging gradient-boosted trees for high precision, low variance ranking models , 2011, SIGIR.

[20]  Alan R. Aronson,et al.  Exploiting a Large Thesaurus for Information Retrieval , 1994, RIAO.