Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies?

The biomedical literature can be seen as a large integrated, but unstructured data repository. Extracting facts from literature and making them accessible is approached from two directions: manual curation efforts develop ontologies and vocabularies to annotate gene products based on statements in papers. Text mining aims to automatically identify entities and their relationships in text using information retrieval and natural language processing techniques. Manual curation is highly accurate but time consuming, and does not scale with the ever increasing growth of literature. Text mining as a high-throughput computational technique scales well, but is error-prone due to the complexity of natural language. How can both be married to combine scalability and accuracy? Here, we review the state-of-the-art text mining approaches that are relevant to annotation and discuss available online services analysing biomedical literature by means of text mining techniques, which could also be utilised by annotation projects. We then examine how far text mining has already been utilised in existing annotation projects and conclude how these techniques could be tightly integrated into the manual annotation process through novel authoring systems to scale-up high-quality manual curation.

[1]  Alexander A. Morgan,et al.  Rutabaga by any other name: extracting biological names , 2002, J. Biomed. Informatics.

[2]  Jin-Dong Kim,et al.  The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .

[3]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[4]  Luana Licata,et al.  Linking entries in protein interaction database to structured text: The FEBS Letters experiment , 2008, FEBS letters.

[5]  Mir S. Siadaty,et al.  Bmc Medical Informatics and Decision Making Relemed: Sentence-level Search Engine with Relevance Score for the Medline Database of Biomedical Articles , 2007 .

[6]  G. Blobel,et al.  Karyopherin-mediated import of integral inner nuclear membrane proteins , 2006, Nature.

[7]  Philipp Cimiano,et al.  Ontology learning and population from text - algorithms, evaluation and applications , 2006 .

[8]  Goran Nenadic,et al.  Mining Biomedical Abstracts: What's in a Term? , 2004, IJCNLP.

[9]  Smaranda Muresan,et al.  Evaluation of the DEFINDER system for fully automatic glossary construction , 2001, AMIA.

[10]  Michael Schroeder,et al.  Inter-species normalization of gene mentions with GNAT , 2008, ECCB.

[11]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[12]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[13]  Bohdan Schneider,et al.  A Biocurator Perspective: Annotation at the Research Collaboratory for Structural Bioinformatics Protein Data Bank , 2006, PLoS Comput. Biol..

[14]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): mouse biology and model systems , 2007, Nucleic Acids Res..

[15]  A. Valencia,et al.  The success (or not) of HUGO nomenclature , 2006, Genome Biology.

[16]  John A. Hamilton,et al.  The TIGR Rice Genome Annotation Resource: improvements and new features , 2006, Nucleic Acids Res..

[17]  Emily Dimmer,et al.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA , 2005, BMC Bioinformatics.

[18]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[19]  Madeline A. Crosby,et al.  FlyBase: genomes by the dozen , 2006, Nucleic Acids Res..

[20]  A. Valencia,et al.  A gene network for navigating the literature , 2004, Nature Genetics.

[21]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[22]  Jing Zhu,et al.  Gaining confidence in biological interpretation of the microarray data: the functional consistence of the significant GO categories , 2008, Bioinform..

[23]  Johanna Völker,et al.  A Framework for Ontology Learning and Data-driven Change Discovery , 2005 .

[24]  Dimitra Alexopoulou,et al.  Word Sense Disambiguation in biomedical ontologies with term co-occurrence analysis and document clustering , 2008, Int. J. Data Min. Bioinform..

[25]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[26]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[27]  Hector Garcia-Molina,et al.  Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems , 2006 .

[28]  Beatrice Alex,et al.  Assisted Curation: Does Text Mining Really Help? , 2007, Pacific Symposium on Biocomputing.

[29]  Hagit Shatkay,et al.  SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. , 2007, Bioinformatics.

[30]  Terri K. Attwood,et al.  BioIE: extracting informative sentences from the biomedical literature , 2005, Bioinform..

[31]  Andrey Rzhetsky,et al.  Imitating Manual Curation of Text-Mined Facts in Biomedicine , 2006, PLoS Comput. Biol..

[32]  Eiichiro Sumita,et al.  Acquiring Synonyms from Monolingual Comparable Texts , 2005, IJCNLP.

[33]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[34]  Norbert E. Fuchs,et al.  Improving Text Mining with Controlled Natural Language: A Case Study for Protein Interactions , 2006, DILS.

[35]  A. Valencia,et al.  A text‐mining perspective on the requirements for electronically annotated abstracts , 2008, FEBS letters.

[36]  Jong C. Park,et al.  Automatic extension of Gene Ontology with flexible identification of candidate terms , 2006, Bioinform..

[37]  Lina Zhou,et al.  Ontology learning: state of the art and open issues , 2007, Inf. Technol. Manag..

[38]  Rolf Apweiler,et al.  GOAnnotator: linking protein GO annotations to evidence text , 2006, Journal of biomedical discovery and collaboration.

[39]  Gerd Stumme,et al.  Formal Concept Analysis: foundations and applications , 2005 .

[40]  Dietrich Rebholz-Schuhmann,et al.  Integrating protein-protein interactions and text mining for protein function prediction , 2008, BMC Bioinformatics.

[41]  Jinah Park,et al.  Monitoring the evolutionary aspect of the Gene Ontology to enhance predictability and usability , 2008, BMC Bioinformatics.

[42]  S. Salzberg Genome re-annotation: a wiki solution? , 2007, Genome Biology.

[43]  Phoebe M. Roberts,et al.  Mining literature for systems biology , 2006, Briefings Bioinform..

[44]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[45]  Gultekin Özsoyoglu,et al.  Annotating Genes Using Textual Patterns , 2006, Pacific Symposium on Biocomputing.

[46]  Kimberly Van Auken,et al.  WormBase: a comprehensive data resource for Caenorhabditis biology and genomics , 2004, Nucleic Acids Res..

[47]  Cheng-Ju Kuo,et al.  High-Recall Gene Mention Recognition by Unification of Multiple Backward Parsing Models , 2007 .

[48]  Ulf Leser,et al.  ALIBABA: PubMed as a graph , 2006, Bioinform..

[49]  Alexander C. Yu,et al.  Methods in biomedical ontology , 2006, J. Biomed. Informatics.

[50]  Ralf Zimmer,et al.  Gene and protein nomenclature in public databases , 2006, BMC Bioinformatics.

[51]  Maria Victoria Schneider,et al.  MINT: a Molecular INTeraction database. , 2002, FEBS letters.

[52]  Ted Briscoe,et al.  Integrating Natural Language Processing with Flybase Curation , 2006, Pacific Symposium on Biocomputing.

[53]  Neil R. Smalheiser,et al.  ADAM: another database of abbreviations in MEDLINE , 2006, Bioinform..

[54]  D. Barrell,et al.  The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. , 2003, Genome research.

[55]  Patrick Ruch,et al.  Automatic assignment of biomedical categories: toward a generic approach , 2006, Bioinform..

[56]  Naoaki Okazaki,et al.  Data and text mining Building an abbreviation dictionary using a term recognition approach , 2006 .

[57]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[58]  K. Bretonnel Cohen,et al.  Intrinsic Evaluation of Text Mining Tools May Not Predict Performance on Realistic Tasks , 2007, Pacific Symposium on Biocomputing.

[59]  Byron Gallis,et al.  Comparison of Francisella tularensis genomes reveals evolutionary events associated with the emergence of human pathogenic strains , 2007, Genome Biology.

[60]  S. Rhee,et al.  Functional Annotation of the Arabidopsis Genome Using Controlled Vocabularies1 , 2004, Plant Physiology.

[61]  Alfred D. Eaton,et al.  HubMed: a web-based biomedical literature search interface , 2006, Nucleic Acids Res..

[62]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[63]  Gary D Bader,et al.  BIND--The Biomolecular Interaction Network Database. , 2001, Nucleic acids research.

[64]  Sophia Ananiadou,et al.  The C-value/NC-value Method of Automatic Recognition for Multi-Word Terms , 1998, ECDL.

[65]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[66]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[67]  Cheng-Ming Chuong,et al.  Pubfocus: Semantic Medline/pubmed Citations Analytics through Integration of Controlled Biomedical Dictionaries and Ranking Algorithm Pubfocus:semanticmedline/pubmedcitations Analyticsthroughintegrationofcontrolledbiomedical Dictionariesandrankingalgorithm , 2022 .

[68]  Nigel Collier,et al.  Synonym set extraction from the biomedical literature by lexical pattern discovery , 2007, BMC Bioinformatics.

[69]  Miguel A. Andrade-Navarro,et al.  Update on XplorMed: a web server for exploring scientific literature , 2003, Nucleic Acids Res..

[70]  Dietrich Rebholz-Schuhmann,et al.  BIOINFORMATICS ORIGINAL PAPER Data and text mining Resolving abbreviations to their senses in Medline , 2005 .

[71]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[72]  Dimitra Alexopoulou,et al.  Terminologies for text-mining; an experiment in the lipoprotein metabolism domain , 2008, BMC Bioinformatics.

[73]  Key-Sun Choi,et al.  Taxonomy Learning using Term Specificity and Similarity , 2006, OntologyLearning@COLING/ACL.

[74]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[75]  Francesco Pinciroli,et al.  Using Gene Ontology and genomic controlled vocabularies to analyze high-throughput gene lists: Three tool comparison , 2006, Comput. Biol. Medicine.

[76]  Mário J. Silva,et al.  Finding genomic ontology terms in text using evidence content , 2005, BMC Bioinformatics.

[77]  Rie Kubota Ando,et al.  BioCreative II Gene Mention Tagging System at IBM Watson , 2007 .

[78]  James Pustejovsky,et al.  Biomedical term mapping databases , 2004, Nucleic Acids Res..

[79]  Hongfang Liu,et al.  Pacific Symposium on Biocomputing 9:238-249(2004) BIOLOGICAL NOMENCLATURES: A SOURCE OF LEXICAL KNOWLEDGE AND AMBIGUITY , 2022 .

[80]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in full text articles , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[81]  Cheng-Ju Kuo,et al.  Rich Feature Set, Unification of Bidirectional Parsing and Dictionary Filtering for High F-Score Gene Mention Tagging. , 2007 .

[82]  Dietrich Rebholz-Schuhmann,et al.  EBIMed - text crunching to gather facts for proteins from Medline , 2007, Bioinform..