The GeneMine system for genome/proteome annotation and collaborative data mining

As genome data and bioinformatics resources grow exponentially in size and complexity, there is an increasing need for software that can bridge the gap between biologists with questions and the worldwide set of highly specialized tools for answering them. The GeneMine system for small- to medium-scale genome analysis provides: (1) automated analysis of DNA (deoxyribonucleic acid) and protein sequence data using over 50 different analysis servers via the Internet, integrating data from homologous functions, tissue expression patterns, mapping, polymorphisms, model organism data and phenotypes, protein structural domains, active sites, motifs and other features, etc., (2) automated filtering and data reduction to highlight significant and interesting patterns, (3) a visual data-mining interface for rapidly exploring correlations, patterns, and contradictions within these data via aggregation, overlay, and drill-down, all projected onto relevant sequence alignments and three-dimensional structures, (4) a plug-in architecture that makes adding new types of analysis, data sources, and servers (including anything on the Internet) as easy as supplying the relevant URLs (uniform resource Locators), (5) a hypertext system that lets users create and share "live" views of their discoveries by embedding three-dimensional structures, alignments, and annotation data within their documents, and (6) an integrated database schema for mining large GeneMine data sets in a relational database. The value of the GeneMine system is that it automatically brings together and uncovers important functional information from a much wider range of sources than a given specialist would normally think to query, resulting in insights that the researcher was not planning to look for. In this paper we present the architecture of the software for integrating and mining very diverse biological data, and cross-validation of gene function predictions. The software is freely available at http://www.bioinformatics.ucla.edu/genemine.

[1]  Alex T. Pang,et al.  Comparative visualization of protein structure-sequence alignments , 1998, Proceedings IEEE Symposium on Information Visualization (Cat. No.98TB100258).

[2]  W. Gropp,et al.  Accepted for publication , 2001 .

[3]  S. Subbiah,et al.  Prediction of protein side-chain conformation by packing optimization. , 1991, Journal of molecular biology.

[4]  Shahrokh Saeednia,et al.  How to maintain both privacy and authentication in digital libraries , 2000 .

[5]  P. Freemont,et al.  The RING finger domain: a recent example of a sequence-structure family. , 1996, Current opinion in structural biology.

[6]  Gerhard Wagner,et al.  Solution Structure of the RAIDD CARD and Model for CARD/CARD Interaction in Caspase-2 and Caspase-9 Recruitment , 1998, Cell.

[7]  J. Lear,et al.  Multivalent ligand-receptor binding interactions in the fibroblast growth factor system produce a cooperative growth factor and heparin mechanism for receptor dimerization. , 1994, Biochemistry.

[8]  Peter D. Karp,et al.  EcoCyc: Encyclopedia of Escherichia coli genes and metabolism , 1998, Nucleic Acids Res..

[9]  Patrick Dowd,et al.  Confirmation of BRCA1 by analysis of germline mutations linked to breast and ovarian cancer in ten families , 1994, Nature Genetics.

[10]  T. Springer,et al.  Experimental support for a beta-propeller domain in integrin alpha-subunits and a calcium binding site on its lower surface. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[12]  Biological Laboratories Divinity Avenue Cambridge Ma Usa. FlyBase FlyBase: a Drosophila database. , 1998, Nucleic acids research.

[13]  C. Lee,et al.  Predicting protein mutant energetics by self-consistent ensemble optimization. , 1994, Journal of molecular biology.

[14]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[15]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Cornelia I. Bargmann,et al.  The C. elegans gene odr-7 encodes an olfactory-specific member of the nuclear receptor superfamily , 1994, Cell.

[17]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[18]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[19]  Otto Ritter,et al.  Characterizing Heterogeneous Molecular Biology Database Systems , 1995, J. Comput. Biol..

[20]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[21]  David R. Gilbert,et al.  FlyBase: a Drosophila database. The FlyBase consortium , 1997, Nucleic Acids Res..

[22]  J. Schug,et al.  GAIA: framework annotation of genomic sequence. , 1998, Genome research.

[23]  Michael N. Edmonson,et al.  Reliable identification of large numbers of candidate SNPs from public EST data , 1999, Nature Genetics.

[24]  R. L. Baldwin,et al.  A specific transition state for S-peptide combining with folded S-protein and then refolding. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998 , 1998, Nucleic Acids Res..

[26]  Francis S. Collins,et al.  Mutations in the BRCA1 gene in families with early-onset breast and ovarian cancer , 1994, Nature Genetics.

[27]  Miguel A. Andrade-Navarro,et al.  Automated genome sequence analysis and annotation , 1999, Bioinform..

[28]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[29]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[30]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[31]  Dmitrij Frishman,et al.  PEDANTic genome analysis , 1997 .

[32]  T Gaasterland,et al.  MAGPIE: automated genome interpretation. , 1996, Trends in genetics : TIG.

[33]  Lewis Y. Geer,et al.  Cn3D: sequence and structure views for Entrez. , 2000, Trends in biochemical sciences.

[34]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[35]  K. Nakai,et al.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. , 1999, Trends in biochemical sciences.

[36]  Amos Bairoch,et al.  The PROSITE database, its status in 1995 , 1996, Nucleic Acids Res..

[37]  S. Henikoff,et al.  Protein family classification based on searching a database of blocks. , 1994, Genomics.

[38]  M. Levitt Accurate modeling of protein conformation by automatic segment matching. , 1992, Journal of molecular biology.

[39]  Limsoon Wong,et al.  BioKleisli: a digital library for biomedical researchers , 1997, International Journal on Digital Libraries.

[40]  Christopher J. Lee,et al.  Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences , 2000, Nature Genetics.

[41]  V. Brusic,et al.  Knowledge discovery and data mining in biological databases , 1999, The Knowledge Engineering Review.

[42]  James I. Garrels,et al.  YPD-A database for the proteins of Saccharomyces cerevisiae , 1996, Nucleic Acids Res..

[43]  Jérôme Gouzy,et al.  ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons , 2000, Nucleic Acids Res..

[44]  Ryuji Kubota,et al.  The Effect of Human β2-Microglobulin on Major Histocompatibility Complex I Peptide Loading and the Engineering of a High Affinity Variant , 1998, The Journal of Biological Chemistry.

[45]  Erik L. L. Sonnhammer,et al.  A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences , 1998, ISMB.