BioContext: an integrated text mining system for large-scale extraction and contextualization of biomolecular events

Motivation: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research. Results: Here we present BioContext, an integrated TM system which extracts, extends and integrates results from a number of tools performing entity recognition, biomolecular event extraction and contextualization. Application of our system to 10.9 million MEDLINE abstracts and 234 000 open-access full-text articles from PubMed Central yielded over 36 million mentions representing 11.4 million distinct events. Event participants included over 290 000 distinct genes/proteins that are mentioned more than 80 million times and linked where possible to Entrez Gene identifiers. Over a third of events contain contextual information such as the anatomical location of the event occurrence or whether the event is reported as negated or speculative. Availability: The BioContext pipeline is available for download (under the BSD license) at http://www.biocontext.org, along with the extracted data which is also available for online browsing. Contact: martin.gerner@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Jari Björne,et al.  Complex event extraction at PubMed scale , 2010, Bioinform..

[2]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[3]  Jun'ichi Tsujii,et al.  Event Extraction with Complex Event Classification Using Rich Features , 2010, J. Bioinform. Comput. Biol..

[4]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[5]  Jun'ichi Tsujii,et al.  Comparative Parser Performance Analysis across Grammar Frameworks through Automatic Tree Conversion using Synchronous Grammars , 2008, COLING.

[6]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[7]  Goran Nenadic,et al.  Gene mention normalization in full texts using GNAT and LINNAEUS , 2010 .

[8]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[9]  Jun'ichi Tsujii,et al.  Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles , 2007, EMNLP.

[10]  Akinori Yonezawa,et al.  Overview of Genia Event Task in BioNLP Shared Task 2011 , 2011, BioNLP@ACL.

[11]  Peter Murray-Rust,et al.  ChemicalTagger: A tool for semantic text-mining in chemistry , 2011, J. Cheminformatics.

[12]  Jari Björne,et al.  U-Compare bio-event meta-service: compatible BioNLP event extraction services , 2011, BMC Bioinformatics.

[13]  Maria Victoria Schneider,et al.  MINT: a Molecular INTeraction database. , 2002, FEBS letters.

[14]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[15]  Livia Perfetto,et al.  MINT, the molecular interaction database: 2012 update , 2011, Nucleic Acids Res..

[16]  Damian Szklarczyk,et al.  The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored , 2010, Nucleic Acids Res..

[17]  Yue Wang,et al.  Incorporating GENETAG-style annotation to GENIA corpus , 2009, BioNLP@HLT-NAACL.

[18]  M. Gerner,et al.  pubmed2ensembl: A Resource for Mining the Biological Literature on Genes , 2011, PloS one.

[19]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[20]  Goran Nenadic,et al.  Biomedical Event Detection using Rules, Conditional Random Fields and Parse Tree Distances , 2009, BioNLP@HLT-NAACL.

[21]  Jari Björne,et al.  Scaling up Biomedical Event Extraction to the Entire PubMed , 2010, BioNLP@ACL.

[22]  Claire Grover,et al.  Proceedings of the BioCreAtIvE II Workshop , 2007 .

[23]  Catherine Blake,et al.  Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles , 2010, J. Biomed. Informatics.

[24]  Deyu Zhou,et al.  Methodological Review: Extracting interactions between proteins from the literature , 2008 .

[25]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[26]  Jari Björne,et al.  Extracting Complex Biological Events with Rich Graph-Based Feature Sets , 2009, BioNLP@HLT-NAACL.

[27]  Xiaoyan Zhu,et al.  GeneTUKit: a software for document-level gene normalization , 2011, Bioinform..

[28]  Michael Schroeder,et al.  Inter-species normalization of gene mentions with GNAT , 2008, ECCB.

[29]  Sampo Pyysalo,et al.  Evaluating Dependency Representations for Event Extraction , 2010, COLING.

[30]  Mihai Surdeanu,et al.  Event Extraction as Dependency Parsing , 2011, ACL.

[31]  Goran Nenadic,et al.  An Exploration of Mining Gene Expression Mentions and Their Anatomical Locations from Biomedical Text , 2010, BioNLP@ACL.

[32]  Goran Nenadic,et al.  The GNAT library for local and remote gene mention normalization , 2011, Bioinform..

[33]  Goran Nenadic,et al.  Using SVMs with the Command Relation features to identify negated events in biomedical literature , 2010, NeSp-NLP@ACL.

[34]  Zhiyong Lu,et al.  PubMed and beyond: a survey of web tools for searching biomedical literature , 2011, Database J. Biol. Databases Curation.

[35]  A. Valencia,et al.  Overview of the protein-protein interaction annotation extraction task of BioCreative II , 2008, Genome Biology.