A Computational Biology Database Digest: Data, Data Analysis, and Data Management

Computational Biology or Bioinformatics has been defined as the application of mathematical and Computer Science methods to solving problems in Molecular Biology that require large scale data, computation, and analysis [26]. As expected, Molecular Biology databases play an essential role in Computational Biology research and development. This paper introduces into current Molecular Biology databases, stressing data modeling, data acquisition, data retrieval, and the integration of Molecular Biology data from different sources. This paper is primarily intended for an audience of computer scientists with a limited background in Biology.

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  Lewis Y. Geer,et al.  Cn3D: sequence and structure views for Entrez. , 2000, Trends in biochemical sciences.

[3]  D. Eisenberg,et al.  Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. , 1983, Journal of molecular biology.

[4]  G. Heijne Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. , 1992, Journal of molecular biology.

[5]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[6]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[7]  Guy Perrière,et al.  EMGLib: the Enhanced Microbial Genomes Library (update 2000) , 2000, Nucleic Acids Res..

[8]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[9]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[10]  J. Thompson,et al.  Multiple sequence alignment with Clustal X. , 1998, Trends in biochemical sciences.

[11]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 2000, Nucleic Acids Res..

[12]  J. Jurka,et al.  Multiple aligned sequence editor (MASE). , 1988, Trends in biochemical sciences.

[13]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001 .

[14]  Stefano Spaccapietra,et al.  Model independent assertions for integration of heterogeneous schemas , 1992, The VLDB Journal.

[15]  R Langridge,et al.  Improvements in protein secondary structure prediction by an enhanced neural network. , 1990, Journal of molecular biology.

[16]  M C Peitsch,et al.  ProMod and Swiss-Model: Internet-based tools for automated comparative protein modelling. , 1996, Biochemical Society transactions.

[17]  Shalom Tsur Data Mining in the Bioinformatics Domain , 2000, VLDB.

[18]  S. Wodak,et al.  Representing and Analysing Molecular and Cellular Function Using the Computer , 2000, Biological chemistry.

[19]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[20]  V M Markowitz,et al.  Facilities for exploring molecular biology databases on the Web: a comparative study. , 1997, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[21]  D G George,et al.  A standardized format for sequence data exchange. , 1987, Protein sequences & data analysis.

[22]  Laks V. S. Lakshmanan,et al.  SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems , 1996, VLDB.

[23]  Jean Thierry-Mieg,et al.  The ACEDB genome database , 1994 .

[24]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[25]  Shahrokh Saeednia,et al.  How to maintain both privacy and authentication in digital libraries , 2000 .

[26]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Carole A. Goble,et al.  TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources , 1998, ISMB.

[28]  H. V. Jagadish,et al.  Data Integration using Self-Maintainable Views , 1996, EDBT.

[29]  Stanley Letovsky,et al.  GDB: the Human Genome Database , 1998, Nucleic Acids Res..

[30]  Peter B. McGarvey,et al.  Protein Information Resource: a community resource for expert annotation of protein data , 2001, Nucleic Acids Res..

[31]  T L Blundell,et al.  FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. , 2001, Journal of molecular biology.

[32]  Michael Gribskov,et al.  Combining evidence using p-values: application to sequence homology searches , 1998, Bioinform..

[33]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[34]  François Bancilhon,et al.  Building an Object-Oriented Database System, The Story of O2 , 1992 .

[35]  A. Sali,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[36]  Gunter Saake,et al.  Adding Conflict Resolution Features to a Query Language for Database Federations , 2000, Australas. J. Inf. Syst..

[37]  Dan S. Prestridge,et al.  SIGNAL SCAN: a computer program that scans DNA sequences for eukaryotic transcriptional elements , 1991, Comput. Appl. Biosci..

[38]  Rolf Backofen,et al.  COMPUTATIONAL MOLECULAR BIOLOGY: AN INTRODUCTION , 2000 .

[39]  Daniel R. Dolk,et al.  Model management and structured modeling: the role of an information resource dictionary system , 1988, CACM.

[40]  O. Lund,et al.  Protein distance constraints predicted by neural networks and probability density functions. , 1997, Protein engineering.

[41]  M. Billeter,et al.  MOLMOL: a program for display and analysis of macromolecular structures. , 1996, Journal of molecular graphics.

[42]  R A Sayle,et al.  RASMOL: biomolecular graphics for all. , 1995, Trends in biochemical sciences.

[43]  N. Guex,et al.  SWISS‐MODEL and the Swiss‐Pdb Viewer: An environment for comparative protein modeling , 1997, Electrophoresis.

[44]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[45]  J M Thornton,et al.  LIGPLOT: a program to generate schematic diagrams of protein-ligand interactions. , 1995, Protein engineering.

[46]  T Etzold,et al.  Unified access to mutation databases. , 1998, Trends in genetics : TIG.

[47]  Emmanuel Barillot,et al.  DBcat: a catalog of 500 biological databases , 2000, Nucleic Acids Res..

[48]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[49]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[50]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[51]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[52]  Peter D. Karp,et al.  A Strategy for Database Interoperation , 1995, J. Comput. Biol..

[53]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[54]  Thomas Lengauer,et al.  Protein Structure Prediction , 2004 .

[55]  Narayanan Eswar,et al.  MODBASE, a database of annotated comparative protein structure models , 2002, Nucleic Acids Res..

[56]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[57]  Andreas D. Baxevanis,et al.  The Histone Database: a comprehensive WWW resource for histones and histone fold-containing proteins , 2000, Nucleic Acids Res..

[58]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[59]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[60]  R. F. Smith,et al.  BCM Search Launcher--an integrated interface to molecular biology data base search and analysis services available on the World Wide Web. , 1996, Genome research.

[61]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[62]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[63]  Esko Ukkonen,et al.  Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data , 2000, ISMB.

[64]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[65]  Jaroslav Koca,et al.  TRITON: in silico construction of protein mutants and prediction of their activities , 2000, Bioinform..

[66]  Carole A. Goble,et al.  An ontology for bioinformatics applications , 1999, Bioinform..

[67]  David R. Gilbert,et al.  Motif-based searching in TOPS protein topology databases , 1999, Bioinform..

[68]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory, Third Edition , 1989, Springer Series in Information Sciences.

[69]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[70]  A Danchin,et al.  Colibri: a functional data base for the Escherichia coli genome. , 1993, Microbiological reviews.

[71]  Peter D. Karp,et al.  An Evaluation of Ontology Exchange Languages for Bioinformatics , 2000, ISMB.

[72]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[73]  Geoffrey J. Barton,et al.  3Dee: a database of protein structural domains , 2001, Bioinform..

[74]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[75]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[76]  S. Johnston,et al.  ORF-FINDER: a vector for high-throughput gene identification. , 2002, Gene.

[77]  Dennis McLeod,et al.  Database description with SDM: a semantic database model , 1981, TODS.

[78]  D. Gusfield Efficient methods for multiple sequence alignment with guaranteed error bounds , 1993 .

[79]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[80]  Ying Xu,et al.  Inferring Gene Structures in Genomic Sequences Using Pattern Recognition and Expressed Sequence Tags , 1997, ISMB.

[81]  Hideaki Sugawara,et al.  DNA Data Bank of Japan (DDBJ) in collaboration with mass sequencing teams , 2000, Nucleic Acids Res..

[82]  I-Min A Chen,et al.  An Overview of the Object-Protocol Model (OPM) and OPM Data Management Tools , 1995, Inf. Syst..

[83]  G J Barton,et al.  Protein structural domains: Analysis of the 3Dee domains database , 2001, Proteins.

[84]  Yuhong Wang,et al.  Storing biological sequence databases in relational form , 2000, Bioinform..

[85]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[86]  M. Kanehisa,et al.  DBGET/LinkDB: an integrated database retrieval system. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[87]  W R Taylor,et al.  A model recognition approach to the prediction of all-helical membrane protein structure and topology. , 1994, Biochemistry.

[88]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.