Biomedical Text Mining and Its Applications

This tutorial is intended for biologists and computational biologists interested in adding text mining tools to their bioinformatics toolbox. As an illustrative example, the tutorial examines the relationship between progressive multifocal leukoencephalopathy (PML) and antibodies. Recent cases of PML have been associated to the administration of some monoclonal antibodies such as efalizumab [1]. Those interested in a further introduction to text mining may also want to read other reviews [2]–[4]. Understanding large amounts of text with the aid of a computer is harder than simply equipping a computer with a grammar and a dictionary. A computer, like a human, needs certain specialized knowledge in order to understand text. The scientific field that is dedicated to train computers with the right knowledge for this task (among other tasks) is called natural language processing (NLP). Biomedical text mining (henceforth, text mining) is the subfield that deals with text that comes from biology, medicine, and chemistry (henceforth, biomedical text). Another popular name is BioNLP, which some practitioners use as synonymous with text mining. Biomedical text is not a homogeneous realm [5]. Medical records are written differently from scientific articles, sequence annotations, or public health guidelines. Moreover, local dialects are not uncommon [6]. For example, medical centers develop their own jargons and laboratories create their idiosyncratic protein nomenclatures. This variability means, in practice, that text mining applications are tailored to specific types of text. In particular, for reasons of availability and cost, many are designed for scientific abstracts in English from Medline.

[1]  A. Rzhetsky,et al.  Self-Correcting Maps of Molecular Pathways , 2006, PloS one.

[2]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3]  Sophia Ananiadou,et al.  FACTA: a text search engine for finding associated biomedical concepts , 2008, Bioinform..

[4]  Meredith Wadman,et al.  Open-access policy flourishes at NIH , 2009, Nature.

[5]  Anantha Bangalore,et al.  The UMLS Knowledge Source Server : An Object Model For Delivering UMLS Data , 2003, AMIA.

[6]  Barend Mons,et al.  Online tools to support literature-based discovery in the life sciences , 2005, Briefings Bioinform..

[7]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[8]  Mark Gerstein,et al.  Getting Started in Text Mining: Part Two , 2009, PLoS Comput. Biol..

[9]  Ramón Alonso Allende Accelerating searches of research grants and scientific literature with novo|[verbar]|seekSM , 2009 .

[10]  P. Bork,et al.  G2D: a tool for mining genes associated with disease , 2005, BMC Genetics.

[11]  R W Baldwin,et al.  Influence of ICRF 159 and triton WR 1339 on metastases of a rat epithelioma , 1975, British Journal of Cancer.

[12]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[13]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[14]  B. De Moor,et al.  TXTGate: profiling gene groups with text-based information , 2004, Genome Biology.

[15]  Hagit Shatkay,et al.  Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users , 2008, Bioinform..

[16]  K. E. Ravikumar,et al.  Literature mining and database annotation of protein phosphorylation using a rule-based system , 2005, Bioinform..

[17]  C. Blaschke,et al.  The potential use of SUISEKI as a protein interaction discovery tool. , 2001, Genome informatics. International Conference on Genome Informatics.

[18]  A. Valencia,et al.  A gene network for navigating the literature , 2004, Nature Genetics.

[19]  Gaurav Pandey,et al.  Computational Approaches for Protein Function Prediction : A Survey , 2006 .

[20]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[21]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[22]  Raul Rodriguez-Esteban,et al.  Figure mining for biomedical research , 2009, Bioinform..

[23]  Michael Krauthammer,et al.  Yale Image Finder (YIF): a new search engine for retrieving biomedical images , 2008, Bioinform..

[24]  K. Bretonnel Cohen,et al.  Intrinsic Evaluation of Text Mining Tools May Not Predict Performance on Realistic Tasks , 2007, Pacific Symposium on Biocomputing.

[25]  Jeyakumar Natarajan,et al.  Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line , 2006, BMC Bioinformatics.

[26]  Carlos Santos,et al.  Data and text mining Wnt pathway curation using automated natural language processing : combining statistical methods with partial and full parse for knowledge extraction , 2005 .

[27]  Hagit Shatkay,et al.  New directions in biomedical text annotation: definitions, guidelines and corpus construction , 2006, BMC Bioinformatics.

[28]  M. Schuemie,et al.  Anni 2.0: a multipurpose text-mining tool for the life sciences , 2008, Genome Biology.

[29]  Michael R. Seringhaus,et al.  Seeking a New Biology through Text Mining , 2008, Cell.

[30]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[31]  D. Swanson Migraine and Magnesium: Eleven Neglected Connections , 2015, Perspectives in biology and medicine.

[32]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[33]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[34]  Udo Hahn,et al.  High-performance gene name normalization with GENO , 2009, Bioinform..

[35]  P. Bork,et al.  Drug Target Identification Using Side-Effect Similarity , 2008, Science.

[36]  A. Valencia,et al.  Text-mining and information-retrieval services for molecular biology , 2005, Genome Biology.

[37]  William H. Majoros,et al.  Genomics and natural language processing , 2002, Nature Reviews Genetics.

[38]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[39]  ChengXiang Zhai,et al.  Automatic annotation of protein motif function with Gene Ontology terms , 2003, BMC Bioinformatics.

[40]  K. E. Ravikumar,et al.  An online literature mining tool for protein phosphorylation , 2006, Bioinform..

[41]  Yanhui Hu,et al.  A Biomedically Enriched Collection of 7000 Human ORF Clones , 2008, PloS one.

[42]  David S. Wishart,et al.  Nucleic Acids Research Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs and Metabolites , 2008 .

[43]  Lawrence Hunter,et al.  Biomedical Discovery Acceleration, with Applications to Craniofacial Development , 2009, PLoS Comput. Biol..

[44]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[45]  Peer Bork,et al.  The way we write , 2003, EMBO Reports.

[46]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[47]  Gabriele Ausiello,et al.  MINT: the Molecular INTeraction database , 2006, Nucleic Acids Res..

[48]  David T. Jones,et al.  Improving classification in protein structure databases using text mining , 2009, BMC Bioinformatics.

[49]  Timur Shtatland,et al.  PepBank - a database of peptides based on sequence text mining and public peptide data sources , 2007, BMC Bioinformatics.

[50]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[51]  Neil R. Smalheiser,et al.  Arrowsmith two-node search interface: A tutorial on finding meaningful links between two disparate sets of articles in MEDLINE , 2009, Comput. Methods Programs Biomed..

[52]  K. Bretonnel Cohen,et al.  U-Compare: share and compare text mining tools with UIMA , 2009, Bioinform..

[53]  王林,et al.  GoPubmed , 2010 .

[54]  B Vastag NIH launches PubMed Central. , 2000, Journal of the National Cancer Institute.

[55]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[56]  Mirana Ramialison,et al.  Rapid identification of PAX2/5/8 direct downstream targets in the otic vesicle by combinatorial use of bioinformatics tools , 2008, Genome Biology.

[57]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[58]  Hodong Lee,et al.  E3Miner: a text mining tool for ubiquitin-protein ligases , 2008, Nucleic Acids Res..

[59]  Hagit Shatkay,et al.  Pacific Symposium on Biocomputing 13:604-615(2008) EPILOC: A (WORKING) TEXT-BASED SYSTEM FOR PREDICTING PROTEIN SUBCELLULAR LOCATION , 2022 .

[60]  Preslav Nakov,et al.  BioText Search Engine: beyond abstract search , 2007, Bioinform..

[61]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[62]  Dietrich Rebholz-Schuhmann,et al.  Text processing through Web services: calling Whatizit , 2008, Bioinform..

[63]  H. Fleury,et al.  Improvement of Progressive Multifocal Leukoencephalopathy After Cidofovir Therapy in a Patient with a Destructive Polyarthritis , 2007, Infection.

[64]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[65]  Hagit Shatkay,et al.  SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. , 2007, Bioinformatics.

[66]  Peer Bork,et al.  LSAT: learning about alternative transcripts in MEDLINE , 2006, Bioinform..

[67]  Thomas C. Rindflesch,et al.  MedPost: a part-of-speech tagger for bioMedical text , 2004, Bioinform..

[68]  Jeffrey M Weinberg,et al.  Patient fatalities potentially associated with efalizumab use. , 2009, Journal of drugs in dermatology : JDD.

[69]  Michael Schroeder,et al.  GoGene: gene annotation in the fast lane , 2009, Nucleic Acids Res..

[70]  K. Bretonnel Cohen,et al.  Getting Started in Text Mining , 2008, PLoS Comput. Biol..

[71]  Alfonso Valencia,et al.  iHOP web services , 2007, Nucleic Acids Res..

[72]  M. Rivera,et al.  Analysis of genomic and proteomic data using advanced literature mining. , 2003, Journal of proteome research.

[73]  Adrian J. Shepherd,et al.  Protein name tagging in the immunological domain , 2008 .

[74]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.