论文信息 - A Part-Of-Speech term weighting scheme for biomedical information retrieval

A Part-Of-Speech term weighting scheme for biomedical information retrieval

In the era of digitalization, information retrieval (IR), which retrieves and ranks documents from large collections according to users' search queries, has been popularly applied in the biomedical domain. Building patient cohorts using electronic health records (EHRs) and searching literature for topics of interest are some IR use cases. Meanwhile, natural language processing (NLP), such as tokenization or Part-Of-Speech (POS) tagging, has been developed for processing clinical documents or biomedical literature. We hypothesize that NLP can be incorporated into IR to strengthen the conventional IR models. In this study, we propose two NLP-empowered IR models, POS-BoW and POS-MRF, which incorporate automatic POS-based term weighting schemes into bag-of-word (BoW) and Markov Random Field (MRF) IR models, respectively. In the proposed models, the POS-based term weights are iteratively calculated by utilizing a cyclic coordinate method where golden section line search algorithm is applied along each coordinate to optimize the objective function defined by mean average precision (MAP). In the empirical experiments, we used the data sets from the Medical Records track in Text REtrieval Conference (TREC) 2011 and 2012 and the Genomics track in TREC 2004. The evaluation on TREC 2011 and 2012 Medical Records tracks shows that, for the POS-BoW models, the mean improvement rates for IR evaluation metrics, MAP, bpref, and P@10, are 10.88%, 4.54%, and 3.82%, compared to the BoW models; and for the POS-MRF models, these rates are 13.59%, 8.20%, and 8.78%, compared to the MRF models. Additionally, we experimentally verify that the proposed weighting approach is superior to the simple heuristic and frequency based weighting approaches, and validate our POS category selection. Using the optimal weights calculated in this experiment, we tested the proposed models on the TREC 2004 Genomics track and obtained average of 8.63% and 10.04% improvement rates for POS-BoW and POS-MRF, respectively. These significant improvements verify the effectiveness of leveraging POS tagging for biomedical IR tasks.

Hongfang Liu | Stephen T. Wu | Dingcheng Li | Yanshan Wang | Saeed Mehrabi

[1] Ellen M. Voorhees,et al. TREC genomics special issue overview , 2009, Information Retrieval.

[2] Sophia Ananiadou,et al. Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[3] Yoav Freund,et al. Boosting: Foundations and Algorithms , 2012 .

[4] Sunghwan Sohn,et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[5] W. Bruce Croft,et al. Learning concept importance using a weighted dependence model , 2010, WSDM '10.

[6] W. Bruce Croft,et al. Latent concept expansion using markov random fields , 2007, SIGIR.

[7] William R. Hersh,et al. Information Retrieval: A Health and Biomedical Perspective , 2002 .

[8] Mokhtar S. Bazaraa,et al. Nonlinear Programming: Theory and Algorithms , 1993 .

[9] Stephen E. Robertson,et al. Okapi at TREC-3 , 1994, TREC.

[10] Abdur Chowdhury,et al. Improving Information Retrieval Systems using Part of Speech Tagging , 1998 .

[11] Hongfang Liu,et al. Using Discharge Summaries to Improve Information Retrieval in Clinical Domain , 2013, CLEF.

[12] Donald Metzler. A Feature-Centric View of Information Retrieval , 2011, The Information Retrieval Series.

[13] Marti A. Hearst,et al. TREC 2007 Genomics Track Overview , 2007, TREC.

[14] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[15] C. Cairns,et al. Computer-facilitated review of electronic medical records reliably identifies emergency department interventions in older adults. , 2013, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[16] Robert M. Losee,et al. Natural language processing in support of decision-making: phrases and part-of-speech tagging , 2001, Inf. Process. Manag..

[17] Frederick Jelinek,et al. Interpolated estimation of Markov source parameters from sparse data , 1980 .

[18] Ellen M. Voorhees,et al. Overview of the TREC 2012 Medical Records Track , 2012, TREC.

[19] Christina Lioma,et al. Part of Speech Based Term Weighting for Information Retrieval , 2009, ECIR.

[20] Mark Levene,et al. Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[21] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[22] Tapio Salakoski,et al. Care episode retrieval: distributional semantic models for information retrieval in the clinical domain , 2014, BMC Medical Informatics and Decision Making.

[23] Ronald Fagin,et al. A formula for incorporating weights into scoring rules , 2000, Theor. Comput. Sci..

[24] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[25] In-Chan Choi,et al. Indexing by Latent Dirichlet Allocation and an Ensemble Model , 2013, J. Assoc. Inf. Sci. Technol..

[26] Éric Gaussier,et al. Information-based models for ad hoc IR , 2010, SIGIR '10.

[27] C. J. van Rijsbergen,et al. Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[28] W. Bruce Croft,et al. A Markov random field model for term dependencies , 2005, SIGIR '05.

[29] John D. Lafferty,et al. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[30] D. Blumenthal,et al. The "meaningful use" regulation for electronic health records. , 2010, The New England journal of medicine.

[31] Fernando Diaz,et al. Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[32] Gari D. Clifford,et al. Shortliffe Edward H, Cimino James J: "Biomedical Informatics; Computer Applications in Health Care and Biomedicine" , 2006 .

[33] Hongfang Liu,et al. Using large clinical corpora for query expansion in text-based cohort identification , 2014, J. Biomed. Informatics.

[34] Sumio Fujita. Revisiting Again Document Length Hypotheses TREC 2004 Genomics Track Experiments at Patolis , 2004, TREC.

[35] W. Bruce Croft,et al. Effective query formulation with multiple information sources , 2012, WSDM '12.