tmBioC: improving interoperability of text-mining tools with BioC

The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogeneity and variety in data formats. In response, BioC is a recent proposal that offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. BioC is a family of XML formats that define how to present text documents and annotations, and also provides easy-to-use functions to read/write documents in the BioC format. In this study, we introduce our text-mining toolkit, which is designed to perform several challenging and significant tasks in the biomedical domain, and repackage the toolkit into BioC to enhance its interoperability. Our toolkit consists of six state-of-the-art tools for named-entity recognition, normalization and annotation (PubTator) of genes (GenNorm), diseases (DNorm), mutations (tmVar), species (SR4GN) and chemicals (tmChem). Although developed within the same group, each tool is designed to process input articles and output annotations in a different format. We modify these tools and enable them to read/write data in the proposed BioC format. We find that, using the BioC family of formats and functions, only minimal changes were required to build the newer versions of the tools. The resulting BioC wrapped toolkit, which we have named tmBioC, consists of our tools in BioC, an annotated full-text corpus in BioC, and a format detection and conversion tool. Furthermore, through participation in the 2013 BioCreative IV Interoperability Track, we empirically demonstrate that the tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. The tmBioC toolkit is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/

[1]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[2]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[3]  Zhiyong Lu,et al.  Corpus Construction for the BioCreative IV GO Task , 2013 .

[4]  Zhiyong Lu,et al.  SR4GN: A Species Recognition Software Tool for Gene Normalization , 2012, PloS one.

[5]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[6]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[7]  Hung-Yu Kao,et al.  Cross-species gene normalization by species inference , 2011, BMC Bioinformatics.

[8]  Nancy Ide,et al.  Proceedings of the Sixth Linguistic Annotation Workshop , 2007 .

[9]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[10]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[11]  Zhiyong Lu,et al.  NCBI at 2013 ShARe/CLEF eHealth Shared Task: Disorder Normalization in Clinical Notes with Dnorm , 2013, CLEF.

[12]  Zhiyong Lu,et al.  A context-blocks model for identifying clinical relationships in patient records , 2011, BMC Bioinformatics.

[13]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[14]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[15]  Dietrich Rebholz-Schuhmann,et al.  Calbc Silver Standard Corpus , 2010, J. Bioinform. Comput. Biol..

[16]  Zhiyong Lu,et al.  Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases , 2011 .

[17]  Zhiyong Lu,et al.  - like interactive curation system for document triage and literature curation , 2012 .

[18]  C. Arighi,et al.  The Gene Ontology Task at BioCreative IV , 2013 .

[19]  Kevin Bretonnel Cohen,et al.  Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing (SETQA-NLP 2009) , 2009 .

[20]  Sunghwan Sohn,et al.  Abbreviation definition identification based on automatic precision estimates , 2008, BMC Bioinformatics.

[21]  Beatrice Gralton,et al.  Washington DC - USA , 2008 .

[22]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[23]  Zhiyong Lu,et al.  Semi-automatic semantic annotation of PubMed queries: A study on quality, efficiency, satisfaction , 2011, J. Biomed. Informatics.

[24]  Steven Bethard,et al.  Building Test Suites for UIMA Components , 2009 .

[25]  Zhiyong Lu,et al.  BioCreative-IV virtual issue , 2014, Database J. Biol. Databases Curation.

[26]  Goran Nenadic,et al.  IeXML: towards an annotation framework for biomedical semantic types enabling interoperability of text processing modules , 2006 .

[27]  Jeyakumar Natarajan,et al.  An overview of the BioCreative 2012 Workshop Track III: interactive text mining task , 2013, Database J. Biol. Databases Curation.

[28]  Zhiyong Lu,et al.  Understanding PubMed® user search behavior through log analysis , 2009, Database J. Biol. Databases Curation.

[29]  Zhiyong Lu,et al.  Extracting Rx information from clinical narrative , 2010, J. Am. Medical Informatics Assoc..

[30]  Zhiyong Lu,et al.  Systematic identification of pharmacogenomics information from clinical trials , 2012, J. Biomed. Informatics.

[31]  Zhiyong Lu,et al.  tmVar: a text mining approach for extracting sequence variants in biomedical literature , 2013, Bioinform..

[32]  K. Bretonnel Cohen,et al.  U-Compare: share and compare text mining tools with UIMA , 2009, Bioinform..

[33]  Amber Stubbs,et al.  MAE and MAI: Lightweight Annotation and Adjudication Tools , 2011, Linguistic Annotation Workshop.

[34]  Zhiyong Lu,et al.  The gene normalization task in BioCreative III , 2011, BMC Bioinformatics.

[35]  Zhiyong Lu,et al.  Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts , 2012, Database J. Biol. Databases Curation.

[36]  Zhiyong Lu,et al.  NCBI at the BioCreative IV CHEMDNER Task : Recognizing chemical names in PubMed articles with tmChem , 2013 .