A Computational Biology Database Digest: Data, Data Analysis, and Data Management

Computational Biology or Bioinformatics has been defined as the application of mathematical and Computer Science methods to solving problems in Molecular Biology that require large scale data, computation, and analysis [26]. As expected, Molecular Biology databases play an essential role in Computational Biology research and development. This paper introduces into current Molecular Biology databases, stressing data modeling, data acquisition, data retrieval, and the integration of Molecular Biology data from different sources. This paper is primarily intended for an audience of computer scientists with a limited background in Biology.

[1]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[2]  D Gusfield,et al.  Efficient methods for multiple sequence alignment with guaranteed error bounds , 1993, Bulletin of mathematical biology.

[3]  Hideaki Sugawara,et al.  DNA Data Bank of Japan (DDBJ) in collaboration with mass sequencing teams , 2000, Nucleic Acids Res..

[4]  I-Min A Chen,et al.  An Overview of the Object-Protocol Model (OPM) and OPM Data Management Tools , 1995, Inf. Syst..

[5]  G J Barton,et al.  Protein structural domains: Analysis of the 3Dee domains database , 2001, Proteins.

[6]  David R. Gilbert,et al.  Motif-based searching in TOPS protein topology databases , 1999, Bioinform..

[7]  Carole A. Goble,et al.  TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources , 1998, ISMB.

[8]  H. V. Jagadish,et al.  Data Integration using Self-Maintainable Views , 1996, EDBT.

[9]  S. Boag,et al.  XQuery 1.0 : An XML query language, W3C Working Draft 12 November 2003 , 2003 .

[10]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[11]  Michael Gribskov,et al.  Combining evidence using p-values: application to sequence homology searches , 1998, Bioinform..

[12]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[13]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[14]  Yuhong Wang,et al.  Storing biological sequence databases in relational form , 2000, Bioinform..

[15]  J. Jurka,et al.  Multiple aligned sequence editor (MASE). , 1988, Trends in biochemical sciences.

[16]  Dan S. Prestridge,et al.  SIGNAL SCAN: a computer program that scans DNA sequences for eukaryotic transcriptional elements , 1991, Comput. Appl. Biosci..

[17]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[18]  Andreas D. Baxevanis,et al.  The Histone Database: a comprehensive WWW resource for histones and histone fold-containing proteins , 2000, Nucleic Acids Res..

[19]  Simon Levin Computational Molecular Biology An Introduction , 2000 .

[20]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[21]  S. Johnston,et al.  ORF-FINDER: a vector for high-throughput gene identification. , 2002, Gene.

[22]  Narayanan Eswar,et al.  MODBASE, a database of annotated comparative protein structure models , 2002, Nucleic Acids Res..

[23]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[24]  M. Billeter,et al.  MOLMOL: a program for display and analysis of macromolecular structures. , 1996, Journal of molecular graphics.

[25]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[26]  R A Sayle,et al.  RASMOL: biomolecular graphics for all. , 1995, Trends in biochemical sciences.

[27]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[28]  M. Kanehisa,et al.  DBGET/LinkDB: an integrated database retrieval system. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[29]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[30]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[31]  W R Taylor,et al.  A model recognition approach to the prediction of all-helical membrane protein structure and topology. , 1994, Biochemistry.

[32]  Peter D. Karp,et al.  A Strategy for Database Interoperation , 1995, J. Comput. Biol..

[33]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[34]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[35]  Peter B. McGarvey,et al.  Protein Information Resource: a community resource for expert annotation of protein data , 2001, Nucleic Acids Res..

[36]  Jaroslav Koca,et al.  TRITON: in silico construction of protein mutants and prediction of their activities , 2000, Bioinform..

[37]  Carole A. Goble,et al.  An ontology for bioinformatics applications , 1999, Bioinform..

[38]  Dennis McLeod,et al.  Database description with SDM: a semantic database model , 1981, TODS.

[39]  Daniel R. Dolk,et al.  Model management and structured modeling: the role of an information resource dictionary system , 1988, CACM.

[40]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[41]  T L Blundell,et al.  FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. , 2001, Journal of molecular biology.

[42]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[43]  A. Sali,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[44]  R. F. Smith,et al.  BCM Search Launcher--an integrated interface to molecular biology data base search and analysis services available on the World Wide Web. , 1996, Genome research.

[45]  M C Peitsch,et al.  ProMod and Swiss-Model: Internet-based tools for automated comparative protein modelling. , 1996, Biochemical Society transactions.

[46]  Laks V. S. Lakshmanan,et al.  SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems , 1996, VLDB.

[47]  Jean Thierry-Mieg,et al.  The ACEDB genome database , 1994 .

[48]  Gunter Saake,et al.  Adding Conflict Resolution Features to a Query Language for Database Federations , 2000, Australas. J. Inf. Syst..

[49]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[50]  J. Thompson,et al.  Multiple sequence alignment with Clustal X. , 1998, Trends in biochemical sciences.

[51]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[52]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[53]  Shalom Tsur Data Mining in the Bioinformatics Domain , 2000, VLDB.

[54]  S. Wodak,et al.  Representing and Analysing Molecular and Cellular Function Using the Computer , 2000, Biological chemistry.

[55]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[56]  Teuvo Kohonen,et al.  Self-organization and associative memory: 3rd edition , 1989 .

[57]  J M Thornton,et al.  LIGPLOT: a program to generate schematic diagrams of protein-ligand interactions. , 1995, Protein engineering.

[58]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[59]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[60]  Limsoon Wong,et al.  BioKleisli: a digital library for biomedical researchers , 1997, International Journal on Digital Libraries.

[61]  A Danchin,et al.  Colibri: a functional data base for the Escherichia coli genome. , 1993, Microbiological reviews.

[62]  Lewis Y. Geer,et al.  Cn3D: sequence and structure views for Entrez. , 2000, Trends in biochemical sciences.

[63]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.

[64]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[65]  Ying Xu,et al.  Inferring Gene Structures in Genomic Sequences Using Pattern Recognition and Expressed Sequence Tags , 1997, ISMB.

[66]  Peter D. Karp,et al.  An Evaluation of Ontology Exchange Languages for Bioinformatics , 2000, ISMB.

[67]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[68]  Nicolás Marín,et al.  Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. , 2003, SGMD.

[69]  Geoffrey J. Barton,et al.  3Dee: a database of protein structural domains , 2001, Bioinform..

[70]  François Bancilhon,et al.  Building an Object-Oriented Database System, The Story of O2 , 1992 .

[71]  O. Lund,et al.  Protein distance constraints predicted by neural networks and probability density functions. , 1997, Protein engineering.

[72]  D. Eisenberg,et al.  Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. , 1983, Journal of molecular biology.

[73]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[74]  Guy Perrière,et al.  EMGLib: the Enhanced Microbial Genomes Library (update 2000) , 2000, Nucleic Acids Res..

[75]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[76]  Stanley Letovsky,et al.  GDB: the Human Genome Database , 1998, Nucleic Acids Res..

[77]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[78]  Shahrokh Saeednia,et al.  How to maintain both privacy and authentication in digital libraries , 2000 .

[79]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[80]  Stefano Spaccapietra,et al.  Model independent assertions for integration of heterogeneous schemas , 1992, The VLDB Journal.

[81]  R Langridge,et al.  Improvements in protein secondary structure prediction by an enhanced neural network. , 1990, Journal of molecular biology.

[82]  V M Markowitz,et al.  Facilities for exploring molecular biology databases on the Web: a comparative study. , 1997, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[83]  T Etzold,et al.  Unified access to mutation databases. , 1998, Trends in genetics : TIG.

[84]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[85]  Emmanuel Barillot,et al.  DBcat: a catalog of 500 biological databases , 2000, Nucleic Acids Res..

[86]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[87]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[88]  N. Guex,et al.  SWISS‐MODEL and the Swiss‐Pdb Viewer: An environment for comparative protein modeling , 1997, Electrophoresis.

[89]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[90]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[91]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[92]  N. J. I. Mars Towards Very Large Knowledge Bases , 1995 .

[93]  Esko Ukkonen,et al.  Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data , 2000, ISMB.

[94]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[95]  G von Heijne,et al.  Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. , 1992, Journal of molecular biology.

[96]  D G George,et al.  A standardized format for sequence data exchange. , 1987, Protein sequences & data analysis.