Mining the Biomedical Literature in the Genomic Era: An Overview

The past decade has seen a tremendous growth in the amount of experimental and computational biomedical data, specifically in the areas of genomics and proteomics. This growth is accompanied by an accelerated increase in the number of biomedical publications discussing the findings. In the last few years, there has been a lot of interest within the scientific community in literature-mining tools to help sort through this abundance of literature and find the nuggets of information most relevant and useful for specific analysis tasks. This paper provides a road map to the various literature-mining methods, both in general and within bioinformatics. It surveys the disciplines involved in unstructured-text analysis, categorizes current work in biomedical literature mining with respect to these disciplines, and provides examples of text analysis methods applied towards meeting some of the current challenges in bioinformatics.

[1]  Philip J. Hayes,et al.  CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[2]  C. Blaschke,et al.  The frame-based module of the SUISEKI information extraction system , 2002 .

[3]  Éric Gaussier,et al.  Towards Automatic Extraction of Monolingual and Bilingual Terminology , 1994, COLING.

[4]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[5]  Federico Mancini,et al.  A technique to automatically assign parts-of-speech to words taking into account word-ending information through a probabilistic model , 1991, EUROSPEECH.

[6]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[7]  D. Swanson Migraine and Magnesium: Eleven Neglected Connections , 2015, Perspectives in biology and medicine.

[8]  James F. Allen Natural language understanding , 1987, Bejnamin/Cummings series in computer science.

[9]  Claire Cardie,et al.  Empirical Methods in Information Extraction , 1997, AI Mag..

[10]  Martin Romacker,et al.  Creating Knowledge Repositories from Biomedical Reports: The MEDSYNDIKATE Text Mining System , 2001, Pacific Symposium on Biocomputing.

[11]  Daniel H. Huson,et al.  The Conserved Exon Method for Gene Finding , 2000, ISMB.

[12]  Paul Horton,et al.  Better Prediction of Protein Cellular Localization Sites with the it k Nearest Neighbors Classifier , 1997, ISMB.

[13]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[14]  Alfonso Valencia,et al.  Critical Assessment of Information Extraction Systems in Biology , 2003, Comparative and functional genomics.

[15]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[16]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[17]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[18]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[19]  Philip J. Hayes,et al.  Guest Editorial - Special Issue on Text Categorization , 1994, ACM Trans. Inf. Syst..

[20]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[21]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[22]  Daphne Koller,et al.  Using machine learning to improve information access , 1998 .

[23]  A Aszódi,et al.  High-throughput functional annotation of novel gene products using document clustering. , 2000, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[24]  Gary D Bader,et al.  BIND--The Biomolecular Interaction Network Database. , 2001, Nucleic acids research.

[25]  Jerry R. Hobbs Resolving pronoun references , 1986 .

[26]  Hinrich Schütze,et al.  Part-of-Speech Induction From Scratch , 1993, ACL.

[27]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[28]  Hagit Shatkay,et al.  Information Retrieval Meets Gene Analysis , 2002, IEEE Intell. Syst..

[29]  W. B. CroftCenter Combining Classiiers in Text Categorization , 1996 .

[30]  Daniel Hanisch,et al.  Playing Biology's Name Game: Identifying Protein Names in Scientific Text , 2002, Pacific Symposium on Biocomputing.

[31]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[32]  Ruslan Mitkov,et al.  Robust Pronoun Resolution with Limited Knowledge , 1998, ACL.

[33]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[34]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[35]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[36]  Yonatan Aumann,et al.  Circle Graphs: New Visualization Tools for Text-Mining , 1999, PKDD.

[37]  Russ B. Altman,et al.  Including Biological Literature Improves Homology Search , 2001, Pacific Symposium on Biocomputing.

[38]  Ramakrishnan Srikant,et al.  Discovering Trends in Text Databases , 1997, KDD.

[39]  Robert Burgin,et al.  Performance Standards and Evaluations in IR Test Collections: Cluster-Based Retrieval Models , 1997, Inf. Process. Manag..

[40]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[41]  Philip J. Hayes,et al.  Intelligent high-volume text processing using shallow, domain-specific techniques , 1992 .

[42]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[43]  Reinhard Guthke,et al.  Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection , 2005, Bioinform..

[44]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[45]  George K. Kokkinakis,et al.  Automatic Stochastic Tagging of Natural Language Texts , 1995, Comput. Linguistics.

[46]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[47]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[48]  Eric Saund,et al.  Applying the Multiple Cause Mixture Model to Text Categorization , 1996, ICML.

[49]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[50]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[51]  David Elworthy,et al.  Does Baum-Welch Re-estimation Help Taggers? , 1994, ANLP.

[52]  The FlyBase database of the Drosophila genome projects and community literature. , 2003, Nucleic acids research.

[53]  Marek Mlodzik,et al.  The planar polarity gene strabismus regulates convergent extension movements in Xenopus , 2002, The EMBO journal.

[54]  Hagit Shatkay,et al.  Finding Themes in Medline Documents , 2000 .

[55]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[56]  T. Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1999, ECML.

[57]  H. Shatkey,et al.  Finding themes in Medline documents - probabilistic similarity search , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[58]  Alfonso Valencia,et al.  The Frame-Based Module of the SUISEKI Information Extraction System , 2002, IEEE Intell. Syst..

[59]  FeldmanRonen,et al.  Rule-based extraction of experimental evidence in the biomedical domain , 2002 .

[60]  Jong C. Park,et al.  Bidirectional Incremental Parsing for Automatic Pathway Identification with Combinatory Categorial Grammar , 2000, Pacific Symposium on Biocomputing.

[61]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[62]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[63]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[64]  William H. Majoros,et al.  Genomics and natural language processing , 2002, Nature Reviews Genetics.

[65]  Alexander A. Morgan,et al.  Background and overview for KDD Cup 2002 task 1: information extraction from biomedical articles , 2002, SKDD.

[66]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[67]  Shalom Lappin,et al.  An Algorithm for Pronominal Anaphora Resolution , 1994, CL.

[68]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[69]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[70]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[71]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[72]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[73]  B J Stapley,et al.  Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[74]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[75]  Michael Krauthammer,et al.  Of truth and pathways: chasing bits of information through myriads of articles , 2002, ISMB.

[76]  Douglas E. Appelt,et al.  Introduction to Information Extraction Technology , 1999, IJCAI 1999.

[77]  Ralph Grishman,et al.  The NYU System for MUC-6 or Where’s the Syntax? , 1995, MUC.

[78]  Michael J. E. Sternberg,et al.  Predicting the Sub-Cellular Location of Proteins from Text Using Support Vector Machines , 2001, Pacific Symposium on Biocomputing.

[79]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[80]  Claire Cardie,et al.  University of Massachusetts: Description of the CIRCUS System as Used for MUC-3 , 1991, MUC.

[81]  Jerry R. Hobbs SRI International's TACITUS system: MUC-3 test results and analysis , 1991, MUC.

[82]  Alexa T. McCray The Unified Medical Language System. the Umls Semantic Network: The UMLS Semantic Network , 1989 .

[83]  Robert Burgin,et al.  Performance Standards and Evaluations in IR Test Collections: Vector-Space and Other Retrieval Models , 1997, Inf. Process. Manag..

[84]  D. Swanson Somatomedin C and Arginine: Implicit Connections between Mutually Isolated Literatures , 2015, Perspectives in biology and medicine.

[85]  Erik L. L. Sonnhammer,et al.  A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences , 1998, ISMB.

[86]  W. John Wilbur,et al.  The Effectiveness of Document Neighboring in Search Enhancement , 1994, Inf. Process. Manag..

[87]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[88]  Gregory Piatetsky-Shapiro,et al.  Knowledge Discovery in Databases: An Overview , 1992, AI Mag..

[89]  Mark Craven,et al.  Representing Sentence Structure in Hidden Markov Models for Information Extraction , 2001, IJCAI.

[90]  Ronen Feldman,et al.  A framework for specifying explicit bias for revision of approximate information extraction rules , 2000, KDD '00.

[91]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[92]  Katerina T. Frantzi,et al.  Incorporating Context Information for the Extraction of Terms , 1997, ACL.

[93]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[94]  Miguel A. Andrade-Navarro,et al.  Automatic Annotation for Biological Sequences by Etraction of Keywords from MEDLINE Abstracts: Development of a Prototype System , 1997, ISMB.

[95]  David Fisher,et al.  Description of the UMass system as used for MUC-6 , 1995, MUC.

[96]  Eugene W. Myers,et al.  Whole-genome DNA sequencing , 1999, Comput. Sci. Eng..

[97]  H. Pearson Biology's name game , 2001, Nature.

[98]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[99]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[100]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[101]  Ronen Feldman,et al.  Rule-based extraction of experimental evidence in the biomedical domain: the KDD Cup 2002 (task 1) , 2002, SKDD.

[102]  Richard A. Harshman,et al.  Information retrieval using a singular value decomposition model of latent semantic structure , 1988, SIGIR '88.

[103]  Yonatan Aumann,et al.  A Comparative Study of Information Extraction Strategies , 2002, CICLing.

[104]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[105]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[106]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[107]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[108]  Douglas E. Appelt,et al.  Introduction to Information Extraction , 1999, AI Commun..

[109]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[110]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[111]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.

[112]  Hagit Shatkay,et al.  Genes, Themes, and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis , 2000, ISMB.

[113]  James I. Garrels,et al.  The Yeast Proteome Database (YPD): a model for the organization and presentation of genome-wide functional data , 1999, Nucleic Acids Res..

[114]  ZhangYong,et al.  Automatic scientific text classification using local patterns , 2002 .

[115]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[116]  Richard Sproat,et al.  Estimating Lexical Priors for Low-Frequency Morphologically Ambiguous Forms , 1996, Comput. Linguistics.

[117]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[118]  Y Yang,et al.  An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts , 1996, Comput. Biol. Medicine.

[119]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[120]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[121]  Donald Hindle,et al.  Acquiring Disambiguation Rules from Text , 1989, ACL.

[122]  Michael W. Berry,et al.  Gene clustering by Latent Semantic Indexing of MEDLINE abstracts , 2005, Bioinform..

[123]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[124]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[125]  John Bear,et al.  Using Information Extraction to Improve Document Retrieval , 1997, TREC.

[126]  G Demetriou,et al.  Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[127]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[128]  S. Karlin,et al.  Finding the genes in genomic DNA. , 1998, Current opinion in structural biology.

[129]  Yonatan Aumann,et al.  A domain independent environment for creating information extraction modules , 2001, CIKM '01.

[130]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[131]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[132]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[133]  Gerald Salton,et al.  Automatic text processing , 1988 .

[134]  Tim Leek,et al.  Information Extraction Using Hidden Markov Models , 1997 .

[135]  J. Rashbass Online Mendelian Inheritance in Man. , 1995, Trends in genetics : TIG.

[136]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[137]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[138]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[139]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[140]  Javed Mostafa,et al.  Detecting Gene Relations from MEDLINE Abstracts , 2000, Pacific Symposium on Biocomputing.

[141]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[142]  M. Goldszmidt,et al.  A Probabilistic Approach to Full-Text Document Clustering , 1998 .

[143]  Richard M. Schwartz,et al.  Coping with Ambiguity and Unknown Words through Probabilistic Models , 1993, CL.

[144]  Martin Rajman,et al.  Text Mining: Natural Language techniques and Text Mining applications , 1998 .

[145]  Susan T. Dumais Enhancing performance in latent semantic indexing , 1990 .

[146]  Moustafa Ghanem,et al.  Automatic scientific text classification using local patterns: KDD CUP 2002 (task 1) , 2002, SKDD.

[147]  M. Eisen,et al.  Gene expression informatics —it's all in your mine , 1999, Nature Genetics.

[148]  James Pustejovsky,et al.  Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations , 2001, Pacific Symposium on Biocomputing.

[149]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[150]  W. John Wilbur,et al.  An information measure of retrieval performance , 1992, Inf. Syst..

[151]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[152]  Donna K. Harman,et al.  The Text REtrieval Conference (TREC) , 1999, NTCIR.

[153]  Elearn Limited,et al.  Information and knowledge management , 2005 .

[154]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.