Generating links to background knowledge: a case study using narrative radiology reports

Automatically annotating texts with background information has recently received much attention. We conduct a case study in automatically generating links from narrative radiology reports to Wikipedia. Such links help users understand the medical terminology and thereby increase the value of the reports. Direct applications of existing automatic link generation systems trained on Wikipedia to our radiology data do not yield satisfactory results. Our analysis reveals that medical phrases are often syntactically regular but semantically complicated, e.g., containing multiple concepts or concepts with multiple modifiers. The latter property is the main reason for the failure of existing systems. Based on this observation, we propose an automatic link generation approach that takes into account these properties. We use a sequential labeling approach with syntactic features for anchor text identification in order to exploit syntactic regularities in medical terminology. We combine this with a sub-anchor based approach to target finding, which is aimed at coping with the complex semantic structure of medical phrases. Empirical results show that the proposed system effectively improves the performance over existing systems.

[1]  P. Haug,et al.  Computerized extraction of coded findings from free-text radiologic reports. Work in progress. , 1990, Radiology.

[2]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[3]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[4]  Andrew Trotman,et al.  Overview of the INEX 2008 Link the Wiki Track , 2008, INEX.

[5]  Peter Willett,et al.  On the creation of hypertext links in full-text documents: measurement of retrieval effectiveness , 1996 .

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  L M Lau,et al.  A natural language understanding system combining syntactic and semantic techniques. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.

[8]  Nigel Collier,et al.  Bio-Medical Entity Extraction using Support Vector Machines , 2005, Artif. Intell. Medicine.

[9]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[10]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[12]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[13]  Gjergji Kasneci,et al.  YAWN: A Semantically Annotated Wikipedia XML Corpus , 2007, BTW.

[14]  James Allan,et al.  Automatic Hypertext Construction , 1995 .

[15]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[16]  Sylvia L. Osborn,et al.  Hypertext versions of journal articles: computer-aided linking and realistic human-based evaluation , 1999 .

[17]  M. de Rijke,et al.  Linking Archives Using Document Enrichment and Term Selection , 2011, TPDL.

[18]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[19]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[20]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[23]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[24]  E. B. Newman,et al.  Tests of a statistical explanation of the rank-frequency relation for words in written English. , 1958, American Journal of Psychology.

[25]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[26]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[27]  M. de Rijke,et al.  Discovering missing links in Wikipedia , 2005, LinkKDD '05.

[28]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[29]  M. de Rijke,et al.  Learning Semantic Query Suggestions , 2009, SEMWEB.

[30]  Andrew Trotman,et al.  The importance of manual assessment in link discovery , 2009, SIGIR.

[31]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[32]  Martin Wattenberg,et al.  Studying cooperation and conflict between authors with history flow visualizations , 2004, CHI.

[33]  Valentin Jijkoun,et al.  Named entity normalization in user generated content , 2008, AND '08.

[34]  Hae-Chang Rim,et al.  Two-Phase Biomedical NE Recognition based on SVMs , 2003, BioNLP@ACL.