Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus

BioC is a new format and associated code libraries for sharing text and annotations. We have implemented BioC natural language preprocessing pipelines in two popular programming languages: C++ and Java. The current implementations interface with the well-known MedPost and Stanford natural language processing tool sets. The pipeline functionality includes sentence segmentation, tokenization, part-of-speech tagging, lemmatization and sentence parsing. These pipelines can be easily integrated along with other BioC programs into any BioC compliant text mining systems. As an application, we converted the NCBI disease corpus to BioC format, and the pipelines have successfully run on this corpus to demonstrate their functionality. Code and data can be downloaded from http://bioc.sourceforge.net. Database URL: http://bioc.sourceforge.net

[1]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[2]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[3]  Zhiyong Lu,et al.  An improved corpus of disease mentions in PubMed citations , 2012, BioNLP@HLT-NAACL.

[4]  James R. Curran,et al.  Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models , 2007, Computational Linguistics.

[5]  Karin M. Verspoor,et al.  Approximate Subgraph Matching-Based Literature Mining for Biomedical Events and Relations , 2013, PloS one.

[6]  Mariana L. Neves,et al.  WBI-DDI: Drug-Drug Interaction Extraction using Majority Voting , 2013, *SEMEVAL.

[7]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[8]  Michael Jünger,et al.  Graph Drawing Software , 2003, Graph Drawing Software.

[9]  Yifan Peng,et al.  BioC interoperability track overview , 2014, Database J. Biol. Databases Curation.

[10]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[11]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[12]  Jens H. Weber,et al.  Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[13]  Thomas C. Rindflesch,et al.  MedPost: a part-of-speech tagger for bioMedical text , 2004, Bioinform..

[14]  Karin M. Verspoor,et al.  Generalizing an Approximate Subgraph Matching-based System to Extract Events in Molecular Biology and Cancer Genetics , 2013, BioNLP@ACL.

[15]  Dekang Lin,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 , 2011 .

[16]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[17]  Jun'ichi Tsujii,et al.  Evaluating contributions of natural language parsers to protein–protein interaction extraction , 2008, Bioinform..

[18]  Jari Björne,et al.  Comparative analysis of five protein-protein interaction corpora , 2008, BMC Bioinformatics.

[19]  Marcelo Fiszman,et al.  The Impact of Directionality in Predications on Text Mining , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[20]  Alexander A. Morgan,et al.  Investigation of Unsupervised Pattern Learning Techniques for Bootstrap Construction of a Medical Treatment Lexicon , 2009, BioNLP@HLT-NAACL.

[21]  Emden R. Gansner,et al.  Graphviz and Dynagraph – Static and Dynamic Graph Drawing Tools , 2003 .

[22]  Arthur C. Graesser,et al.  Evaluating State-of-the-Art Treebank-style Parsers for Coh-Metrix and Other Learning Technology Environments , 2005 .

[23]  W. John Wilbur,et al.  Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora , 2014, Database J. Biol. Databases Curation.

[24]  Henry Soldano,et al.  Ontology-based semantic annotation: an automatic hybrid rule-based method , 2013, BioNLP@ACL.

[25]  Daniel Jurafsky,et al.  Parsing to Stanford Dependencies: Trade-offs between Speed and Accuracy , 2010, LREC.

[26]  James R. Curran,et al.  Parsing the WSJ Using CCG and Log-Linear Models , 2004, ACL.

[27]  Mihai Surdeanu,et al.  Event Extraction as Dependency Parsing , 2011, ACL.

[28]  Dongwook Shin,et al.  Clustering cliques for graph-based summarization of the biomedical research literature , 2013, BMC Bioinformatics.

[29]  Rada Mihalcea,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Langu , 2011, ACL 2011.

[30]  Wendy W. Chapman,et al.  Methods Paper: Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger , 2007, J. Am. Medical Informatics Assoc..

[31]  Khalid Choukri,et al.  The european language resources association , 1998, LREC.

[32]  Karin M. Verspoor,et al.  BioLemmatizer: a lemmatization tool for morphological processing of biomedical text , 2012, J. Biomed. Semant..

[33]  Karin M. Verspoor,et al.  Extracting Biomedical Events and Modifications Using Subgraph Matching with Noisy Training Data , 2013, BioNLP@ACL.

[34]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[35]  Peter J. Haug,et al.  Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation , 2013, J. Am. Medical Informatics Assoc..