Social and Semantic Web Technologies for the Text-to-Knowledge Translation Process in Biomedicine

Currently, biomedical research critically depends on knowledge availability for flexible re-analysis and integrative post-processing. The voluminous biological data already stored in databases, put together with the abundantmolecular data resulting from the rapid adoption of high-throughput techniques, have shown the potential to generate new biomedical discovery through integration with knowledge from the scientific literature. Reliable information extraction applications have been a long-sought goal of the biomedical text mining community. Both named entity recognition and conceptual analysis are needed in order to map the objects and concepts represented by natural language texts into a rigorous encoding, with direct links to online resources that explicitly expose those concepts semantics (see Figure 1). Naturally, automated methods work at a fraction of human accuracy, while expert curation has a small fraction of computer coverage. Hence, mining the wealth of knowledge in the published literature requires a hybrid approach which combines efficient automated methods with highly-accurate expert curation. This work reviews several efforts in both directions and contributes to advance the hybrid approach. Since Life Sciences have turned into a very data-intensive domain, various sources of biological data must often be combined in order to build new knowledge. The Semantic Web offers a social and technological basis for assembling, integrating andmaking biomedical knowledge available at Web scale. In this chapter we present an open-source, modular friendly system called BioNotate-2.0, which combines automated text annotation with distributed expert curation, and serves the resulting knowledge in a Semantic-Web-accessible format to be integrated into a wider bio-medical inference pipeline. While this has been an active area of research and development for a few years, we believe that this is an unique contribution which will be widely adopted to enable the community effort both in the area of further systems development and knowledge sharing. 24

[1]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[2]  H. Cunningham,et al.  A framework and graphical development environment for robust NLP tools and applications. , 2002, ACL 2002.

[3]  Dietrich Rebholz-Schuhmann,et al.  Distributed Modules for Text Annotation and IE Applied to the Biomedical Domain , 2004, NLPBA/BioNLP.

[4]  Alexander A. Morgan,et al.  Data preparation and interannotator agreement: BioCreAtIvE Task 1B , 2005, BMC Bioinformatics.

[5]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2005, Nucleic Acids Res..

[6]  Thomas Werner,et al.  LitMiner and WikiGene: identifying problem-related key players of gene regulation using publication abstracts , 2005, Nucleic Acids Res..

[7]  D. Rebholz-Schuhmann,et al.  Facts from Text—Is Text Mining Ready to Deliver? , 2005, PLoS biology.

[8]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[9]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[10]  Zhiyong Lu,et al.  OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression , 2008, BMC Bioinformatics.

[11]  M. Ashburner,et al.  Calling on a million minds for community annotation in WikiProteins , 2008, Genome Biology.

[12]  C Baral,et al.  CBioC: beyond a prototype for collaborative annotation of molecular interactions from the literature. , 2007, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[13]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Research.

[14]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[15]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[16]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[17]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[18]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[19]  Michael Schroeder,et al.  Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? , 2008, Briefings Bioinform..

[20]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.

[21]  Dietrich Rebholz-Schuhmann,et al.  Text processing through Web services: calling Whatizit , 2008, Bioinform..

[22]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[23]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[24]  Armando Blanco,et al.  Collaborative text-annotation resource for disease-centered relation extraction from biomedical text , 2009, J. Biomed. Informatics.

[25]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[26]  B. Mons,et al.  Nano-Publication in the e-science era , 2009 .

[27]  Rong Chen,et al.  Ontology-driven indexing of public datasets for translational bioinformatics , 2009, BMC Bioinformatics.

[28]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[29]  Kei-Hoi Cheung,et al.  Linking Open Drug Data , 2009, I-SEMANTICS.

[30]  C. Bizer,et al.  Enabling Tailored Therapeutics with Linked Data , 2009 .

[31]  K. Bretonnel Cohen,et al.  U-Compare: share and compare text mining tools with UIMA , 2009, Bioinform..

[32]  Rodrigo Lopez,et al.  Web services at the European Bioinformatics Institute-2009 , 2009, Nucleic Acids Res..

[33]  Jaime G. Carbonell,et al.  Efficiently learning the accuracy of labeling sources for selective sampling , 2009, KDD.

[34]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.