Challenges in adapting text mining for full text articles to assist pathway curation

Annotation of biological pathway databases is largely driven by manual effort with little assistance from text mining. It is a great challenge to the pathway curators to keep up with the pace of ever-growing literature. There have been recent efforts to fill this gap through text mining by identifying the relevant papers and the textual evidence pertaining to pathway information. In the current work, we evaluated the performance of a text mining system that extracts events describing molecular pathways from full text articles and its potential role in assisting manual curation of pathway databases. We specifically investigated the merits of mining full text articles for extracting pathway events by comparing the performance of our system on both full text articles and biomedical abstracts. From the preliminary results, we observed that by processing full text articles the performance of the system improved by nearly 22% against a small drop of 5% in the precision in comparison against the extractions from PubMed abstracts. Preliminary analysis of the text mining results for selected pathways from PharmGKB suggest that the pathway curators do use their biological knowledge to infer new information that go beyond what is often expressed in either the full text articles or abstracts. This study is an attempt to identify the magnitude of gaps that exist between the text mining deliverables and the demands of pathway curation.

[1]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[2]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[3]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[4]  Michel Dumontier,et al.  Controlled vocabularies and semantics in systems biology , 2011, Molecular systems biology.

[5]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[6]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[7]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[8]  Sarala M. Wimalaratne,et al.  The Systems Biology Graphical Notation , 2009, Nature Biotechnology.

[9]  Zhiyong Lu,et al.  OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression , 2008, BMC Bioinformatics.

[10]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[11]  Naoaki Okazaki,et al.  Kleio: a knowledge-enriched information retrieval system for biology , 2008, SIGIR '08.

[12]  Lawrence E Hunter,et al.  Parenthetically speaking: classifying the contents of parentheses for text mining. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[13]  Hiroaki Kitano,et al.  The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models , 2003, Bioinform..

[14]  David Milward,et al.  Mining protein-protein interactions from published literature using Linguamatics I2E. , 2009, Methods in molecular biology.

[15]  Xiaoyan Zhu,et al.  GeneTUKit: a software for document-level gene normalization , 2011, Bioinform..

[16]  Sampo Pyysalo,et al.  Overview of the Pathway Curation (PC) task of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[17]  Jari Björne,et al.  Extracting Complex Biological Events with Rich Graph-Based Feature Sets , 2009, BioNLP@HLT-NAACL.

[18]  Sophia Ananiadou,et al.  NaCTeM EventMine for BioNLP 2013 CG and PC tasks , 2013, BioNLP@ACL.

[19]  Eduard H. Hovy,et al.  Layout-aware text extraction from full-text PDF of scientific articles , 2012, Source Code for Biology and Medicine.

[20]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[21]  Hongfang Liu,et al.  Using machine learning for concept extraction on clinical documents from multiple data sources , 2011, J. Am. Medical Informatics Assoc..

[22]  Michael Krauthammer,et al.  GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data , 2004, J. Biomed. Informatics.

[23]  Sophia Ananiadou,et al.  FACTA: a text search engine for finding associated biomedical concepts , 2008, Bioinform..

[24]  Jari Björne,et al.  Complex event extraction at PubMed scale , 2010, Bioinform..

[25]  Yukiko Matsuoka,et al.  PathText: a text mining integrator for biological pathway visualizations , 2010, Bioinform..

[26]  Sampo Pyysalo,et al.  Medie and Info-pubmed: 2010 update , 2010, BMC Bioinformatics.

[27]  Hongfang Liu,et al.  Towards Pathway Curation Through Literature Mining - A Case Study Using PharmGKB , 2013, Pacific Symposium on Biocomputing.

[28]  Russ B. Altman,et al.  PharmGKB: the Pharmacogenetics Knowledge Base , 2002, Nucleic Acids Res..

[29]  Michael C. Rosenstein,et al.  The Comparative Toxicogenomics Database (CTD): a resource for comparative toxicological studies. , 2006, Journal of experimental zoology. Part A, Comparative experimental biology.

[30]  Jari Björne,et al.  TEES 2.1: Automated Annotation Scheme Learning in the BioNLP 2013 Shared Task , 2013, BioNLP@ACL.

[31]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.