Entity Linking for Biomedical Literature

The Entity Linking (EL) task links entity mentions from an unstructured document to entities in a knowledge base. Although this problem is well-studied in news and social media, this problem has not received much attention in the life science domain. One outcome of tackling the EL problem in the life sciences domain is to enable scientists to build computational models of biological processes with more efficiency. However, simply applying a news-trained entity linker produces inadequate results. Since existing supervised approaches require a large amount of manually-labeled training data, which is currently unavailable for the life science domain, we propose a novel unsupervised collective inference approach to link entities from unstructured full texts of biomedical literature to 300 ontologies. The approach leverages the rich semantic information and structures in ontologies for similarity computation and entity ranking. Without using any manual annotation, our approach significantly outperforms state-of-the-art supervised EL method (9% absolute gain in linking accuracy). Furthermore, the state-of-the-art supervised EL method requires 15,000 manually annotated entity mentions for training. These promising results establish a benchmark for the EL task in the life science domain1. We also provide in depth analysis and discussion on both challenges and opportunities on automatic knowledge enrichment for scientific literature. In this paper, we propose a novel unsupervised collective inference approach to address the EL problem in a new domain. We show that our unsupervised approach is able to outperform a current state-of-the-art supervised approach that has been trained with a large amount of manually labeled data. Life science presents an underrepresented domain for applying EL techniques. By providing a small benchmark data set and identifying opportunities, we hope to stimulate discussions across natural language processing and bioinformatics and motivate others to develop techniques for this largely untapped domain.

[1]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[2]  Lakshmi M. Akella,et al.  NetiNeti: discovery of scientific names from text using machine learning methods , 2010, BMC Bioinformatics.

[3]  Guodong Zhou,et al.  Dependency-Driven Feature-based Learning for Extracting Protein-Protein Interactions from Biomedical Text , 2010, COLING.

[4]  Yitong Li,et al.  Entity Linking for Tweets , 2013, ACL.

[5]  Wanxiang Che,et al.  A Graph-based Method for Entity Linking , 2011, IJCNLP.

[6]  Yang Jin,et al.  Human Gene Name Normalization using Text Matching with Automatically Extracted Synonym Dictionaries , 2006, BioNLP@NAACL-HLT.

[7]  Matthias Frisch,et al.  LitInspector: literature and signal transduction pathway mining in PubMed abstracts , 2009, Nucleic Acids Res..

[8]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[9]  Naoaki Okazaki,et al.  Automatic Acquisition of Huge Training Data for Bio-Medical Named Entity Recognition , 2011, BioNLP@ACL.

[10]  Alexander A. Morgan,et al.  Overview of BioCreAtIvE task 1B: normalized gene lists , 2005, BMC Bioinformatics.

[11]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[12]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[13]  Dan Roth,et al.  Relational Inference for Wikification , 2013, EMNLP.

[14]  Po-Ting Lai,et al.  Multistage Gene Normalization and SVM-Based Ranking for Protein Interactor Extraction in Full-Text Articles , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[16]  Wei Shen,et al.  Linking named entities in Tweets with knowledge base via user interest modeling , 2013, KDD.

[17]  Pierre Zweigenbaum,et al.  Medical Entity Recognition: A Comparaison of Semantic and Statistical Methods , 2011, BioNLP@ACL.

[18]  Marek Kimmel,et al.  Mathematical model of NF- κB regulatory module , 2004 .

[19]  Heng Ji,et al.  Analysis and Enhancement of Wikification for Microblogs with Context Expansion , 2012, COLING.

[20]  Jun'ichi Tsujii,et al.  A Rich Feature Vector for Protein-Protein Interaction Extraction from Multiple Corpora , 2009, EMNLP.

[21]  Silviu Cucerzan,et al.  TAC Entity Linking by Performing Full-document Entity Extraction and Disambiguation , 2011, TAC.

[22]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[23]  Paolo Ferragina,et al.  TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[24]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[25]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[26]  J. Biesiada,et al.  Feature ranking methods based on information entropy with Parzen windows , 2005 .

[27]  John K. Tsotsos,et al.  Saliency Based on Information Maximization , 2005, NIPS.

[28]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[29]  Ganesh Ramakrishnan,et al.  Collective annotation of Wikipedia entities in web text , 2009, KDD.

[30]  K. Cohen,et al.  Biomedical language processing: what's beyond PubMed? , 2006, Molecular cell.

[31]  Patrick Pantel,et al.  Entity Extraction via Ensemble Semantics , 2009, EMNLP.

[32]  Dan Roth,et al.  The Use of Classifiers in Sequential Inference , 2001, NIPS.