Analysis of biological processes and diseases using text mining approaches.

A number of biomedical text mining systems have been developed to extract biologically relevant information directly from the literature, complementing bioinformatics methods in the analysis of experimentally generated data. We provide a short overview of the general characteristics of natural language data, existing biomedical literature databases, and lexical resources relevant in the context of biomedical text mining. A selected number of practically useful systems are introduced together with the type of user queries supported and the results they generate. The extraction of biological relationships, such as protein-protein interactions as well as metabolic and signaling pathways using information extraction systems, will be discussed through example cases of cancer-relevant proteins. Basic strategies for detecting associations of genes to diseases together with literature mining of mutations, SNPs, and epigenetic information (methylation) are described. We provide an overview of disease-centric and gene-centric literature mining methods for linking genes to phenotypic and genotypic aspects. Moreover, we discuss recent efforts for finding biomarkers through text mining and for gene list analysis and prioritization. Some relevant issues for implementing a customized biomedical text mining system will be pointed out. To demonstrate the usefulness of literature mining for the molecular oncology domain, we implemented two cancer-related applications. The first tool consists of a literature mining system for retrieving human mutations together with supporting articles. Specific gene mutations are linked to a set of predefined cancer types. The second application consists of a text categorization system supporting breast cancer-specific literature search and document-based breast cancer gene ranking. Future trends in text mining emphasize the importance of community efforts such as the BioCreative challenge for the development and integration of multiple systems into a common platform provided by the BioCreative Metaserver.

[1]  Yang Jin,et al.  Automated recognition of malignancy mentions in biomedical literature , 2006, BMC Bioinformatics.

[2]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[3]  Fan Meng,et al.  Medline search engine for finding genetic markers with biological significance , 2007, Bioinform..

[4]  Xiaoyan Zhu,et al.  Exploiting and integrating rich features for biological literature classification , 2008, BMC Bioinformatics.

[5]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[6]  Maurice Bouwhuis,et al.  CoPub: a literature-based keyword enrichment tool for microarray data analysis , 2008, Nucleic Acids Res..

[7]  A. Valencia,et al.  A text‐mining perspective on the requirements for electronically annotated abstracts , 2008, FEBS letters.

[8]  Fernando Pereira,et al.  Automatically annotating documents with normalized gene lists , 2005, BMC Bioinformatics.

[9]  Luana Licata,et al.  Linking entries in protein interaction database to structured text: The FEBS Letters experiment , 2008, FEBS letters.

[10]  Jeffrey T. Chang,et al.  The computational analysis of scientific literature to define and recognize gene expression clusters. , 2003, Nucleic acids research.

[11]  Alfonso Valencia,et al.  CARGO: a web portal to integrate customized biological information , 2007, Nucleic Acids Res..

[12]  Karen L. Mohlke,et al.  Data and text mining A computational system to select candidate genes for complex human traits , 2007 .

[13]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[14]  Son Doan,et al.  BioCaster: detecting public health rumors with a Web-based text mining system , 2008, Bioinform..

[15]  Chitta Baral,et al.  Mining Gene-Disease Relationships from Biomedical Literature: Weighting Proteinprotein Interactions and Connectivity , 2006, Pacific Symposium on Biocomputing.

[16]  K. E. Ravikumar,et al.  Beyond the clause: extraction of phosphorylation information from medline abstracts , 2005, ISMB.

[17]  Dietrich Rebholz-Schuhmann,et al.  Facilitating the development of controlled vocabularies for metabolomics technologies with text mining , 2008, BMC Bioinformatics.

[18]  Dietrich Rebholz-Schuhmann,et al.  Text processing through Web services: calling Whatizit , 2008, Bioinform..

[19]  Christiane Fellbaum,et al.  Towards new information resources for public health - From WordNet to MedicalWordNet , 2006, J. Biomed. Informatics.

[20]  Zhiyong Lu,et al.  Evaluation of Lexical Methods for Detecting Relationships Between Concepts from Multiple Ontologies , 2006, Pacific Symposium on Biocomputing.

[21]  Padmini Srinivasan,et al.  Mining MEDLINE for implicit links between dietary substances and diseases , 2004, ISMB/ECCB.

[22]  Bart De Moor,et al.  Endeavour update: a web resource for gene prioritization in multiple species , 2008, Nucleic Acids Res..

[23]  Christian Blaschke,et al.  Text Mining for Metabolic Pathways, Signaling Cascades, and Protein Networks , 2005, Science's STKE.

[24]  Alfonso Valencia,et al.  Defining functional distances over Gene Ontology , 2008, BMC Bioinformatics.

[25]  Alfred D. Eaton,et al.  HubMed: a web-based biomedical literature search interface , 2006, Nucleic Acids Res..

[26]  Jeyakumar Natarajan,et al.  Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line , 2006, BMC Bioinformatics.

[27]  ChengXiang Zhai,et al.  An empirical study of tokenization strategies for biomedical information retrieval , 2007, Information Retrieval.

[28]  Udo Hahn,et al.  An Approach to Text Corpus Construction which Cuts Annotation Costs and Maintains Reusability of Annotated Data , 2007, EMNLP.

[29]  Helen L. Johnson,et al.  Concept recognition for extracting protein interaction relations from biomedical text , 2008, Genome Biology.

[30]  Francesco Mancuso,et al.  Identification and Analysis of Co-Occurrence Networks with NetCutter , 2008, PloS one.

[31]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[32]  M. Schuemie,et al.  Anni 2.0: a multipurpose text-mining tool for the life sciences , 2008, Genome Biology.

[33]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[34]  Hsinchun Chen,et al.  Global mapping of gene/protein interactions in PubMed abstracts: A framework and an experiment with P53 interactions , 2007, J. Biomed. Informatics.

[35]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[36]  J. Douglas Armstrong,et al.  G2Cdb: the Genes to Cognition database , 2008, Nucleic Acids Res..

[37]  David S. Wishart,et al.  Nucleic Acids Research Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs and Metabolites , 2008 .

[38]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001 .

[39]  Yang Jin,et al.  An entity tagger for recognizing acquired genomic variations in cancer literature , 2004, Bioinform..

[40]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[41]  Joaquín Dopazo,et al.  Babelomics: advanced functional profiling of transcriptomics, proteomics and genomics experiments , 2008, Nucleic Acids Res..

[42]  Timur Shtatland,et al.  PepBank - a database of peptides based on sequence text mining and public peptide data sources , 2007, BMC Bioinformatics.

[43]  Toshihisa Takagi,et al.  Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource. , 2003, Genome research.

[44]  Sophia Ananiadou,et al.  FACTA: a text search engine for finding associated biomedical concepts , 2008, Bioinform..

[45]  Alfonso Valencia,et al.  Text mining and protein annotations: the construction and use of protein description sentences. , 2006, Genome informatics. International Conference on Genome Informatics.

[46]  Naoaki Okazaki,et al.  Data and text mining Building an abbreviation dictionary using a term recognition approach , 2006 .

[47]  Sophia Ananiadou,et al.  Normalizing biomedical terms by minimizing ambiguity and variability , 2008, BMC Bioinformatics.

[48]  Tapio Salakoski,et al.  Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches , 2006, BMC Bioinformatics.

[49]  Julio Collado-Vides,et al.  Automatic reconstruction of a bacterial regulatory network using Natural Language Processing , 2007, BMC Bioinformatics.

[50]  Thomas C. Rindflesch,et al.  MedPost: a part-of-speech tagger for bioMedical text , 2004, Bioinform..

[51]  Dietrich Rebholz-Schuhmann,et al.  MedEvi: Retrieving textual evidence of relations between biomedical concepts from Medline , 2008, Bioinform..

[52]  Gerben Menschaert,et al.  PubMeth: a cancer methylation database combining text-mining and expert annotation , 2007, Nucleic Acids Res..

[53]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[54]  Carol Friedman,et al.  PhenoGO: Assigning Phenotypic Context to Gene Ontology Annotations with Natural Language Processing , 2005, Pacific Symposium on Biocomputing.

[55]  Simon M. Lin,et al.  MedlineR: an open source library in R for Medline literature data mining , 2004, Bioinform..

[56]  Teruyoshi Hishiki,et al.  Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning , 2005, Pacific Symposium on Biocomputing.

[57]  Steven J. M. Jones,et al.  Text-mining assisted regulatory annotation , 2008, Genome Biology.

[58]  C. Blaschke,et al.  Expression profiles and biological function. , 2000, Genome informatics. Workshop on Genome Informatics.

[59]  Sergei Maslov,et al.  Automatic Pathway Building in Biological Association Networks , 2006 .

[60]  M. Khoury,et al.  A navigator for human genome epidemiology , 2008, Nature Genetics.

[61]  James Lewis,et al.  Data and text mining Text similarity : an alternative way to search MEDLINE , 2006 .

[62]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[63]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[64]  Dietrich Rebholz-Schuhmann,et al.  EBIMed - text crunching to gather facts for proteins from Medline , 2007, Bioinform..

[65]  K. Bretonnel Cohen,et al.  MutationFinder: a high-performance system for extracting point mutation mentions from text , 2007, Bioinform..

[66]  S. Perkins,et al.  CoagMDB: a database analysis of missense mutations within four conserved domains in five vitamin K–dependent coagulation serine proteases using a text‐mining tool , 2008, Human mutation.

[67]  Jeyakumar Natarajan,et al.  Functional gene clustering via gene annotation sentences, MeSH and GO keywords from biomedical literature , 2007, Bioinformation.

[68]  Michael Schroeder,et al.  Inter-species normalization of gene mentions with GNAT , 2008, ECCB.

[69]  Lakshmanan K. Iyer,et al.  A combined approach to data mining of textual and structured data to identify cancer-related targets , 2006, BMC Bioinformatics.

[70]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[71]  Mathieu Lemire,et al.  Genes to Diseases (G2D) Computational Method to Identify Asthma Candidate Genes , 2008, PloS one.

[72]  Zhiyong Lu,et al.  Semantic role labeling for protein transport predicates , 2008, BMC Bioinformatics.

[73]  Christian von Mering,et al.  STITCH: interaction networks of chemicals and proteins , 2007, Nucleic Acids Res..

[74]  Laura Inés Furlong,et al.  OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature , 2008, BMC Bioinformatics.

[75]  Steven J. M. Jones,et al.  CGMIM: Automated text-mining of Online Mendelian Inheritance in Man (OMIM) to identify genetically-associated cancers and candidate genes , 2005, BMC Bioinformatics.

[76]  Muin J. Khoury,et al.  GAPscreener: An automatic tool for screening human genetic association literature in PubMed using the support vector machine technique , 2008, BMC Bioinformatics.

[77]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[78]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[79]  Fabio Rinaldi,et al.  Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach , 2007, Artif. Intell. Medicine.

[80]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[81]  Patrick S. Schnable,et al.  Using the biological taxonomy to access biological literature with PathBinderH , 2005, Bioinform..

[82]  Jun'ichi Tsujii,et al.  New challenges for text mining: mapping between text and manually curated pathways , 2008, BMC Bioinformatics.

[83]  Alfonso Valencia,et al.  Evaluation of BioCreAtIvE assessment of task 2 , 2005, BMC Bioinformatics.

[84]  Mir S. Siadaty,et al.  Bmc Medical Informatics and Decision Making Relemed: Sentence-level Search Engine with Relevance Score for the Medline Database of Biomedical Articles , 2007 .

[85]  Hodong Lee,et al.  E3Miner: a text mining tool for ubiquitin-protein ligases , 2008, Nucleic Acids Res..

[86]  M. Romacker,et al.  OntoGene in BioCreative II , 2007, Genome Biology.

[87]  Toshihisa Takagi,et al.  Data and text mining Automatic extraction of gene / protein biological functions from biomedical text , 2005 .

[88]  Philip E. Bourne,et al.  BioLit: integrating biological literature with databases , 2008, Nucleic Acids Res..

[89]  Antje Chang,et al.  BRENDA, AMENDA and FRENDA the enzyme information system: new content and tools in 2009 , 2008, Nucleic Acids Res..

[90]  Preslav Nakov,et al.  BioText Search Engine: beyond abstract search , 2007, Bioinform..

[91]  Yasunori Yamamoto,et al.  Biomedical knowledge navigation by literature clustering , 2007, J. Biomed. Informatics.

[92]  Hsuan-Cheng Huang,et al.  MeInfoText: associated gene methylation and cancer information from text mining , 2008, BMC Bioinformatics.

[93]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[94]  Sandra Orchard,et al.  The Annotation of Both Human and Mouse Kinomes in UniProtKB/Swiss-Prot , 2008, Molecular & Cellular Proteomics.

[95]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[96]  A. Valencia,et al.  Overview of the protein-protein interaction annotation extraction task of BioCreative II , 2008, Genome Biology.

[97]  Hsinchun Chen,et al.  Extracting gene pathway relations using a hybrid grammar: the Arizona Relation Parser , 2004, Bioinform..

[98]  A. Valencia,et al.  Creating Reference Datasets for Systems Biology Applications Using Text Mining , 2009, Annals of the New York Academy of Sciences.

[99]  C. Blaschke,et al.  The potential use of SUISEKI as a protein interaction discovery tool. , 2001, Genome informatics. International Conference on Genome Informatics.