Literature mining for the biologist: from information retrieval to biological discovery

For the average biologist, hands-on literature mining currently means a keyword search in PubMed. However, methods for extracting biomedical facts from the scientific literature have improved considerably, and the associated tools will probably soon be used in many laboratories to automatically annotate and analyse the growing number of system-wide experimental data sets. Owing to the increasing body of text and the open-access policies of many journals, literature mining is also becoming useful for both hypothesis generation and biological discovery. However, the latter will require the integration of literature and high-throughput data, which should encourage close collaborations between biologists and computational linguists.

[1]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[2]  D. Swanson Migraine and Magnesium: Eleven Neglected Connections , 2015, Perspectives in biology and medicine.

[3]  D. Swanson Somatomedin C and Arginine: Implicit Connections between Mutually Isolated Literatures , 2015, Perspectives in biology and medicine.

[4]  Don R. Swanson,et al.  Intervening in the Life Cycles of Scientific Knowledge Patrick Wilson, The Value of Currency , 1993, Libr. Trends.

[5]  W. John Wilbur,et al.  The Effectiveness of Document Neighboring in Search Enhancement , 1994, Inf. Process. Manag..

[6]  Neil R. Smalheiser,et al.  Assessing a gap in the biomedical literature: Magnesium deficiency and neurologic disease , 1994 .

[7]  D. Swanson,et al.  Linking estrogen to Alzheimer's disease , 1996, Neurology.

[8]  Y Yang,et al.  An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts , 1996, Comput. Biol. Medicine.

[9]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[10]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[11]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[12]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[13]  M. Bard,et al.  Transcriptional regulation of the squalene synthase gene (ERG9) in the yeast Saccharomyces cerevisiae. , 1999, Biochimica et biophysica acta.

[14]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[15]  Marc Weeber,et al.  Text-based discovery in biomedicine: the architecture of the DAD-system , 2000, AMIA.

[16]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[17]  B J Stapley,et al.  Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[18]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[19]  Denys Proux,et al.  A Pragmatic Information Extraction Strategy for Gathering Data on Genetic Interactions , 2000, ISMB.

[20]  F. Schweisguth,et al.  Repression by suppressor of hairless and activation by Notch are required to define a single row of single-minded expressing cells in the Drosophila embryo. , 2000, Genes & development.

[21]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[22]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[23]  A Aszódi,et al.  High-throughput functional annotation of novel gene products using document clustering. , 2000, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[24]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[25]  Yang Xiao,et al.  Shared Roles of Yeast Glycogen Synthase Kinase 3 Family Members in Nitrogen-Responsive Phosphorylation of Meiotic Regulator Ume6p , 2000, Molecular and Cellular Biology.

[26]  P Bork,et al.  Automated extraction of information in molecular biology , 2000, FEBS letters.

[27]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[28]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001 .

[29]  R. Young,et al.  Negative regulation of Gcn4 and Msn2 transcription factors by Srb10 cyclin-dependent kinase. , 2001, Genes & development.

[30]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.

[31]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[32]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[33]  Javed Mostafa,et al.  Detecting Gene Relations from MEDLINE Abstracts , 2000, Pacific Symposium on Biocomputing.

[34]  A. Valencia,et al.  Mining functional information associated with expression arrays , 2001, Functional & Integrative Genomics.

[35]  Michael Gribskov,et al.  Use of keyword hierarchies to interpret gene expression patterns , 2001, Bioinform..

[36]  J. Lopes,et al.  Expression of the INO2 regulatory gene of Saccharomyces cerevisiae is controlled by positive and negative promoter elements and an upstream open reading frame , 2001, Molecular microbiology.

[37]  P Bork,et al.  XplorMed: a tool for exploring MEDLINE abstracts. , 2001, Trends in biochemical sciences.

[38]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[39]  William H. Majoros,et al.  Genomics and natural language processing , 2002, Nature Reviews Genetics.

[40]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[41]  Mikhail V. Blagosklonny,et al.  Conceptual biology: Unearthing the gems , 2002, Nature.

[42]  D. Chaussabel,et al.  Mining microarray expression data by literature profiling , 2002, Genome Biology.

[43]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[44]  R. Altman,et al.  Using text analysis to identify functionally coherent gene groups. , 2002, Genome research.

[45]  D. Engelberg,et al.  HSF and Msn2/4p can exclusively or cooperatively activate the yeast HSP104 gene , 2002, Molecular microbiology.

[46]  Michael Krauthammer,et al.  Of truth and pathways: chasing bits of information through myriads of articles , 2002, ISMB.

[47]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[48]  Jeffrey B. Colombe,et al.  Finding relevant references to genes and proteins in Medline using a Bayesian approach , 2002, Bioinform..

[49]  J. Blattman,et al.  CD8+ T cell responses: it's all downhill after their prime ... , 2002, Nature Immunology.

[50]  C. Perez-Iratxeta,et al.  Worldwide Scientific Publishing Activity , 2002, Science.

[51]  M. Whitelaw,et al.  Differential Activities of Murine Single Minded 1 (SIM1) and SIM2 on a Hypoxic Response Element , 2002, The Journal of Biological Chemistry.

[52]  Alfonso Valencia,et al.  The Frame-Based Module of the SUISEKI Information Extraction System , 2002, IEEE Intell. Syst..

[53]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[54]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[55]  B. De Moor,et al.  TXTGate: profiling gene groups with text-based information , 2004, Genome Biology.

[56]  Jeffrey T. Chang,et al.  The computational analysis of scientific literature to define and recognize gene expression clusters. , 2003, Nucleic acids research.

[57]  Alfonso Valencia,et al.  Life cycles of successful genes. , 2003, Trends in genetics : TIG.

[58]  Bart De Moor,et al.  Evaluation of the Vector Space Representation in Text-Based Gene Clustering , 2002, Pacific Symposium on Biocomputing.

[59]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[60]  Preslav Nakov,et al.  BioText Team Report for the TREC 2003 Genomics Track , 2003, TREC.

[61]  Mehmet Kayaalp,et al.  Methods for Accurate Retrieval of MEDLINE Citations in Functional Genomics , 2003, TREC.

[62]  Michael Lappe,et al.  From gene networks to gene function. , 2003, Genome research.

[63]  William R. Hersh,et al.  TREC GENOMICS Track Overview , 2003, TREC.

[64]  Alexander A. Morgan,et al.  Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup , 2003, ISMB.

[65]  Steven Dickman,et al.  Tough Mining , 2003, PLoS biology.

[66]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[67]  Russ B. Altman,et al.  GAPSCORE: finding gene and protein names one word at a time , 2004, Bioinform..

[68]  Martijn J. Schuemie,et al.  Thesaurus-based disambiguation of gene symbols , 2005, BMC Bioinformatics.

[69]  T. Gilliam,et al.  Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's disease. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[70]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[71]  Miguel A. Andrade-Navarro,et al.  Ranking the whole MEDLINE database according to a large training set using text indexing , 2005, BMC Bioinformatics.

[72]  Michael Krauthammer,et al.  GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data , 2004, J. Biomed. Informatics.

[73]  Burkhard Rost,et al.  Protein names precisely peeled off free text , 2004, ISMB/ECCB.

[74]  A. Valencia,et al.  A gene network for navigating the literature , 2004, Nature Genetics.

[75]  Rob Jelier,et al.  CoPub Mapper: mining MEDLINE based on search term co-publication , 2005, BMC Bioinformatics.

[76]  James W. Cooper,et al.  Discovery of protein-protein interactions using a combination of linguistic, statistical and graphical information , 2005, BMC Bioinformatics.

[77]  Padmini Srinivasan,et al.  Mining MEDLINE for implicit links between dietary substances and diseases , 2004, ISMB/ECCB.

[78]  Matteo Pellegrini,et al.  Prolinks: a database of protein functional linkages derived from coevolution , 2004, Genome Biology.

[79]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[80]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[81]  M. Newman Coauthorship networks and patterns of scientific collaboration , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[82]  Razvan C. Bunescu,et al.  Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome , 2005, Genome Biology.

[83]  Charles L. A. Clarke,et al.  Domain-Specific Synonym Expansion and Validation for Biomedical Information Retrieval (MultiText Experiments for TREC 2004) , 2004, TREC.

[84]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[85]  Jonathan D. Wren,et al.  Extending the mutual information measure to rank inferred literature relationships , 2004, BMC Bioinformatics.

[86]  Jonathan D. Wren,et al.  Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network , 2004, Bioinform..

[87]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[88]  Ralf Zimmer,et al.  A simple approach for protein name identification: prospects and limits , 2005, BMC Bioinformatics.

[89]  P. Bork,et al.  G2D: a tool for mining genes associated with disease , 2005, BMC Genetics.

[90]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[91]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[92]  Alan R. Powell,et al.  Integration of text- and data-mining using ontologies successfully selects disease gene candidates , 2005, Nucleic acids research.

[93]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[94]  K. E. Ravikumar,et al.  Beyond the clause: extraction of phosphorylation information from medline abstracts , 2005, ISMB.

[95]  J. Winderickx,et al.  The Ccr4-Not Complex Independently Controls both Msn2-Dependent Transcriptional Activation—via a Newly Identified Glc7/Bud14 Type I Protein Phosphatase Module—and TFIID Promoter Distribution , 2005, Molecular and Cellular Biology.

[96]  Hongfang Liu,et al.  Gene name ambiguity of eukaryotic nomenclatures , 2005, Bioinform..

[97]  P. Bork,et al.  Dynamic Complex Formation During the Yeast Cell Cycle , 2005, Science.

[98]  R. Zitomer,et al.  Genetic Factors That Regulate the Attenuation of the General Stress Response of Yeast , 2005, Genetics.

[99]  Dietrich Rebholz-Schuhmann,et al.  BIOINFORMATICS ORIGINAL PAPER Data and text mining Resolving abbreviations to their senses in Medline , 2005 .

[100]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[101]  Jian Su,et al.  Recognition of protein/gene names from text using an ensemble of classifiers , 2005, BMC Bioinformatics.

[102]  Joyce A. Mitchell,et al.  Using literature-based discovery to identify disease candidate genes , 2005, Int. J. Medical Informatics.

[103]  Christian von Mering,et al.  STRING: known and predicted protein–protein associations, integrated and transferred across organisms , 2004, Nucleic Acids Res..

[104]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[105]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[106]  Christian Blaschke,et al.  Text Mining for Metabolic Pathways, Signaling Cascades, and Protein Networks , 2005, Science's STKE.

[107]  D. Rebholz-Schuhmann,et al.  Facts from Text—Is Text Mining Ready to Deliver? , 2005, PLoS biology.

[108]  Fernando Pereira,et al.  Automatically annotating documents with normalized gene lists , 2005, BMC Bioinformatics.

[109]  Peer Bork,et al.  Extraction of Transcript Diversity from Scientific Literature , 2005, PLoS Comput. Biol..

[110]  A. Valencia,et al.  Text-mining and information-retrieval services for molecular biology , 2005, Genome Biology.

[111]  Mark Craven,et al.  Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text , 2005, BMC Bioinformatics.

[112]  Peer Bork,et al.  Systematic Association of Genes to Phenotypes by Genome and Literature Mining , 2005, PLoS biology.

[113]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from the literature: Part II , 2005, Bioinform..

[114]  Jung-Eun Park,et al.  Concerted mechanism of Swe1/Wee1 regulation by multiple kinases in budding yeast , 2005, The EMBO journal.

[115]  Claus-Wilhelm von der Lieth,et al.  PubFinder: a tool for improving retrieval rate of relevant PubMed abstracts , 2005, Nucleic Acids Res..

[116]  Peer Bork,et al.  Extraction of regulatory gene/protein networks from Medline , 2006, Bioinform..