Knowledge Discovery in Biology and Biotechnology Texts: A Review of Techniques, Evaluation Strategies, and Applications

Abstract Arguably, the richest source of knowledge (as opposed to fact and data collections) about biology and biotechnology is captured in natural-language documents such as technical reports, conference proceedings and research articles. The automatic exploitation of this rich knowledge base for decision making, hypothesis management (generation and testing) and knowledge discovery constitutes a formidable challenge. Recently, a set of technologies collectively referred to as knowledge discovery in text (KDT) has been advocated as a promising approach to tackle this challenge. KDT comprises three main tasks: information retrieval, information extraction and text mining. These tasks are the focus of much recent scientific research and many algorithms have been developed and applied to documents and text in biology and biotechnology. This article introduces the basic concepts of KDT, provides an overview of some of these efforts in the field of bioscience and biotechnology, and presents a framework of commonly used techniques for evaluating KDT methods, tools and systems.

[1]  Gene H. Golub,et al.  Matrix computations , 1983 .

[2]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[3]  James F. Allen Natural language understanding , 1987, Bejnamin/Cummings series in computer science.

[4]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[5]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[6]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[7]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[8]  Y Yang,et al.  An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts , 1996, Comput. Biol. Medicine.

[9]  Neil R. Smalheiser,et al.  Artificial Intelligence An interactive system for finding complementary literatures : a stimulus to scientific discovery , 1995 .

[10]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[11]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[12]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[13]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[14]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[15]  Park,et al.  Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.

[16]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[17]  N R Smalheiser,et al.  Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. , 1998, Computer methods and programs in biomedicine.

[18]  Ng,et al.  Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts. , 1999, Genome informatics. Workshop on Genome Informatics.

[19]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[20]  Shivakumar Vaithyanathan,et al.  Model Selection in Unsupervised Learning with Applications To Document Clustering , 1999, International Conference on Machine Learning.

[21]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[22]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[23]  Allen C. Browne,et al.  Analysis of biomedical text for chemical names: a comparison of three methods , 1999, AMIA.

[24]  V. Brusic,et al.  Knowledge discovery and data mining in biological databases , 1999, The Knowledge Engineering Review.

[25]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[26]  G Demetriou,et al.  Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[27]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[28]  B J Stapley,et al.  Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[29]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[30]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[31]  Virginia Reviewer-Teller,et al.  Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[32]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[33]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[34]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[35]  Toshihisa Takagi,et al.  Automated extraction of information on protein-protein interactions from the biological literature , 2001, Bioinform..

[36]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.

[37]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[38]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[39]  SystemLimsoon Wong,et al.  A Protein Interaction Extraction System , 2001 .

[40]  Jong C. Park,et al.  Bidirectional Incremental Parsing for Automatic Pathway Identification with Combinatory Categorial Grammar , 2000, Pacific Symposium on Biocomputing.

[41]  Padmini Srinivasan,et al.  MeSHmap: a text mining tool for MEDLINE , 2001, AMIA.

[42]  T. Ideker,et al.  A new approach to decoding life: systems biology. , 2001, Annual review of genomics and human genetics.

[43]  P Bork,et al.  XplorMed: a tool for exploring MEDLINE abstracts. , 2001, Trends in biochemical sciences.

[44]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[45]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[46]  Alfonso Valencia,et al.  Automatic ontology construction from the literature. , 2002, Genome informatics. International Conference on Genome Informatics.

[47]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[48]  Chiara Sabatti,et al.  Statistical Issues in Microarray Analysis , 2002 .

[49]  Fredrik Olsson,et al.  Exploiting Syntax when Detecting Protein Names in Text , 2002 .

[50]  Michael J. E. Sternberg,et al.  Predicting the Sub-Cellular Location of Proteins from Text Using Support Vector Machines , 2001, Pacific Symposium on Biocomputing.

[51]  James Pustejovsky,et al.  Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations , 2001, Pacific Symposium on Biocomputing.

[52]  Martin Romacker,et al.  Creating Knowledge Repositories from Biomedical Reports: The MEDSYNDIKATE Text Mining System , 2001, Pacific Symposium on Biocomputing.

[53]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[54]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[55]  Peter Jackson,et al.  Natural language processing for online applications : text retrieval, extraction and categorization , 2002 .

[56]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[57]  R. Altman,et al.  Using text analysis to identify functionally coherent gene groups. , 2002, Genome research.

[58]  W. John Wilbur,et al.  A Thematic Analysis of the AIDS Literature , 2001, Pacific Symposium on Biocomputing.

[59]  Hsinchun Chen,et al.  Filling Preposition-Based Templates to Capture Information from Medical Abstracts , 2001, Pacific Symposium on Biocomputing.

[60]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[61]  Padmini Srinivasan,et al.  Exploring text mining from MEDLINE , 2002, AMIA.

[62]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..

[63]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[64]  Werner Dubitzky,et al.  A Practical Approach to Microarray Data Analysis , 2003, Springer US.

[65]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[66]  Daniel Hanisch,et al.  : identifying , 2022 .

[67]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[68]  Michael Krauthammer,et al.  GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data , 2004, J. Biomed. Informatics.

[69]  Yang Jin,et al.  An entity tagger for recognizing acquired genomic variations in cancer literature , 2004, Bioinform..

[70]  K. Bretonnel Cohen,et al.  Natural Language Processing and Systems Biology , 2004, Artificial Intelligence Methods And Tools For Systems Biology.

[71]  Ulf Leser,et al.  Finding kinetic parameters using text mining. , 2004, Omics : a journal of integrative biology.

[72]  Jeyakumar Natarajan,et al.  Text Mining of Full Text Articles and Creation of a Knowledge Base for Analysis of Microarray Data , 2004, KELSI.

[73]  Barbara Rosario,et al.  Classifying Semantic Relations in Bioscience Texts , 2004, ACL.

[74]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[75]  G. Karypis,et al.  Criterion functions for document clustering , 2005 .