Creating Reference Datasets for Systems Biology Applications Using Text Mining

High‐throughput experimental techniques are generating large data collections with the aim of identifying novel entities involved in fundamental cellular processes as well as drawing a systematic picture of the relationships between individual components. Determining the accuracy of the resulting data and the selection of a subset of targets for more careful characterizations often requires relying on information provided by manually annotated data repositories. These repositories are incomplete and cover only a small fraction of the knowledge contained in the literature. We propose in this paper the use of text‐mining technologies to extract, organize, and present information relevant for a particular biological topic. The aims of the resulting approach are (1) to enable topic‐centric biological literature navigation, (2) to assist in the construction of manually revised data repositories, (3) to provide prioritization of biological entities for experimental studies, and (4) to enable human interpretation of large‐scale experiments by providing direct links of bio‐entities to relevant descriptions in the literature.

[1]  J. Suflita,et al.  Ecology and evolution of microbial populations for bioremediation. , 1993, Trends in biotechnology.

[2]  T. Mitchison,et al.  Microtubule polymerization dynamics. , 1997, Annual review of cell and developmental biology.

[3]  T. Mitchison,et al.  A method that allows the assembly of kinetochore components onto chromosomes condensed in clarified Xenopus egg extracts. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Control of microtubule dynamics by the antagonistic activities of XMAP215 and XKCM1 in Xenopus egg extracts , 1999, Nature Cell Biology.

[5]  G. C. Rogers,et al.  Microtubule motors in mitosis , 2000, Nature.

[6]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[7]  A. Valencia,et al.  Mining functional information associated with expression arrays , 2001, Functional & Integrative Genomics.

[8]  Torsten Wittmann,et al.  The spindle: a dynamic assembly of microtubules and motors , 2001, Nature Cell Biology.

[9]  A. Musacchio,et al.  The spindle checkpoint: structural insights into dynamic signalling , 2002, Nature Reviews Molecular Cell Biology.

[10]  Alfonso Valencia,et al.  The organization of the microbial biodegradation network from a systems‐biology perspective , 2003, EMBO reports.

[11]  P. Bork,et al.  A protocol for the update of references to scientific literature in biological databases. , 2003, Applied bioinformatics.

[12]  Miguel A. Andrade-Navarro,et al.  Ranking the whole MEDLINE database according to a large training set using text indexing , 2005, BMC Bioinformatics.

[13]  A. Valencia,et al.  A gene network for navigating the literature , 2004, Nature Genetics.

[14]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[15]  Ulf Leser,et al.  Finding kinetic parameters using text mining. , 2004, Omics : a journal of integrative biology.

[16]  Erich A Nigg,et al.  Proteome Analysis of the Human Mitotic Spindle*S , 2005, Molecular & Cellular Proteomics.

[17]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2005, Nucleic Acids Res..

[18]  Hongfang Liu,et al.  Gene name ambiguity of eukaryotic nomenclatures , 2005, Bioinform..

[19]  Shamkant B. Navathe,et al.  Investigation into biomedical literature classification using support vector machines , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[20]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[21]  S. Brunak,et al.  New weakly expressed cell cycle‐regulated genes in yeast , 2005, Yeast.

[22]  Peer Bork,et al.  Comparison of computational methods for the identification of cell cycle-regulated genes , 2005, Bioinform..

[23]  Alfonso Valencia,et al.  MetaRouter: bioinformatics for bioremediation , 2004, Nucleic Acids Res..

[24]  Alexander A. Morgan,et al.  Overview of BioCreAtIvE task 1B: normalized gene lists , 2005, BMC Bioinformatics.

[25]  J. Bähler Cell-cycle control of gene expression in budding and fission yeast. , 2005, Annual review of genetics.

[26]  Lynda B. M. Ellis,et al.  The University of Minnesota Biocatalysis/Biodegradation Database: the first decade , 2005, Nucleic Acids Res..

[27]  Aaron M. Cohen,et al.  An Effective General Purpose Approach for Automated Biomedical Document Classification , 2006, AMIA.

[28]  E. Nigg,et al.  Timely anaphase onset requires a novel spindle and kinetochore complex comprising Ska1 and Ska2 , 2006, The EMBO journal.

[29]  William Stafford Noble,et al.  Support vector machine , 2013 .

[30]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[31]  J. Berman Morphogenesis and cell cycle progression in Candida albicans. , 2006, Current opinion in microbiology.

[32]  E. Nigg,et al.  HURP Is a Ran-Importin β-Regulated Protein that Stabilizes Kinetochore Microtubules in the Vicinity of Chromosomes , 2006, Current Biology.

[33]  Peer Bork,et al.  LSAT: learning about alternative transcripts in MEDLINE , 2006, Bioinform..

[34]  Giacomo Cavalli,et al.  The role of Polycomb Group Proteins in Cell Cycle Regulation During Development , 2006, Cell cycle.

[35]  K. Sawin,et al.  Cytoplasmic microtubule organization in fission yeast , 2006, Yeast.

[36]  L. Jensen,et al.  The more the merrier: comparative analysis of microarray studies on cell cycle‐regulated genes in fission yeast , 2006, Yeast.

[37]  Phoebe M. Roberts,et al.  Mining literature for systems biology , 2006, Briefings Bioinform..

[38]  D. Inzé,et al.  Cyclin-Dependent Kinase Inhibitors in Yeast, Animals, and Plants: A Functional Comparison , 2006, Critical reviews in biochemistry and molecular biology.

[39]  Aurélien Naldi,et al.  Dynamical analysis of a generic Boolean model for the control of the mammalian cell cycle , 2006, ISMB.

[40]  D. Lew,et al.  Eavesdropping on the cytoskeleton: progress and controversy in the yeast morphogenesis checkpoint. , 2006, Current opinion in microbiology.

[41]  C. Smales,et al.  Control and regulation of the cellular responses to cold shock: the responses in yeast and mammalian systems. , 2006, The Biochemical journal.

[42]  Joaquín Dopazo,et al.  BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments , 2006, Nucleic Acids Res..

[43]  S. Leevers,et al.  RNA interference pinpoints regulators of cell size and the cell cycle , 2006, Genome Biology.

[44]  R. Paro,et al.  Signaling meets chromatin during tissue regeneration in Drosophila. , 2006, Current opinion in genetics & development.

[45]  P. Bork,et al.  Co-evolution of transcriptional and post-translational cell-cycle regulation , 2006, Nature.

[46]  Nathan H. Lents,et al.  RNA interference takes flight: a new RNAi screen reveals cell cycle regulators in Drosophila cells , 2006, Trends in Endocrinology & Metabolism.

[47]  Russ B. Altman,et al.  MScanner: a classifier for retrieving Medline citations , 2008, BMC Bioinformatics.

[48]  A. Hyman,et al.  Genome-scale RNAi profiling of cell division in human tissue culture cells , 2007, Nature Cell Biology.

[49]  Hagit Shatkay,et al.  SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. , 2007, Bioinformatics.

[50]  Pascal Kahlem,et al.  ENFIN—a Network to Enhance Integrative Systems Biology , 2007, Annals of the New York Academy of Sciences.

[51]  Alfonso Valencia,et al.  The environmental fate of organic pollutants through the global microbial metabolism , 2007, Molecular Systems Biology.

[52]  Steven J. M. Jones,et al.  Text-mining assisted regulatory annotation , 2008, Genome Biology.

[53]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[54]  Joaquín Dopazo,et al.  The role of the environment in Parkinson's disease. , 1996, Nucleic Acids Res..

[55]  M. Feany,et al.  Connecting cell-cycle activation to neurodegeneration in Drosophila. , 2007, Biochimica et biophysica acta.

[56]  Qing Zhang,et al.  Automating document classification for the Immune Epitope Database , 2007, BMC Bioinformatics.

[57]  Beatrice Alex,et al.  Assisted Curation: Does Text Mining Really Help? , 2007, Pacific Symposium on Biocomputing.

[58]  A. Valencia,et al.  A text‐mining perspective on the requirements for electronically annotated abstracts , 2008, FEBS letters.

[59]  Luana Licata,et al.  Linking entries in protein interaction database to structured text: The FEBS Letters experiment , 2008, FEBS letters.

[60]  A. Valencia,et al.  Systemic approaches to biodegradation. , 2009, FEMS microbiology reviews.