DWARF – a data warehouse system for analyzing protein families

BackgroundThe emerging field of integrative bioinformatics provides the tools to organize and systematically analyze vast amounts of highly diverse biological data and thus allows to gain a novel understanding of complex biological systems. The data warehouse DWARF applies integrative bioinformatics approaches to the analysis of large protein families.DescriptionThe data warehouse system DWARF integrates data on sequence, structure, and functional annotation for protein fold families. The underlying relational data model consists of three major sections representing entities related to the protein (biochemical function, source organism, classification to homologous families and superfamilies), the protein sequence (position-specific annotation, mutant information), and the protein structure (secondary structure information, superimposed tertiary structure). Tools for extracting, transforming and loading data from public available resources (ExPDB, GenBank, DSSP) are provided to populate the database. The data can be accessed by an interface for searching and browsing, and by analysis tools that operate on annotation, sequence, or structure. We applied DWARF to the family of α/β-hydrolases to host the Lipase Engineering database. Release 2.3 contains 6138 sequences and 167 experimentally determined protein structures, which are assigned to 37 superfamilies 103 homologous families.ConclusionDWARF has been designed for constructing databases of large structurally related protein families and for evaluating their sequence-structure-function relationships by a systematic analysis of sequence, structure and functional annotation. It has been applied to predict biochemical properties from sequence, and serves as a valuable tool for protein engineering.

[1]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[2]  K. Jaeger,et al.  Bacterial lipolytic enzymes: classification and properties. , 1999, The Biochemical journal.

[3]  O Ritter,et al.  Prototype implementation of the integrated genomic database. , 1994, Computers and biomedical research, an international journal.

[4]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[5]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[6]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[7]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[8]  Thierry Hotelier,et al.  ESTHER, the database of the alpha/beta-hydrolase fold superfamily of proteins. , 2004, Nucleic acids research.

[9]  Thomas Steinke,et al.  Columba: Multidimensional Data Integration of Protein Annotations , 2004, DILS.

[10]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[11]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[12]  Joel L. Sussman,et al.  The α/β hydrolase fold , 1992 .

[13]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[14]  Frances M. G. Pearl,et al.  The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues. , 2000, Protein engineering.

[15]  James E. Johnson,et al.  MetaFam: a unified classification of protein families. II. Schema and query capabilities , 2001, Bioinform..

[16]  Jürgen Pleiss,et al.  Sequence and structure of epoxide hydrolases: A systematic analysis , 2004, Proteins.

[17]  Kenneth H. Fasman,et al.  Restructuring the Genome Data Base: A Model for a Federation of Biological Databases , 1994, J. Comput. Biol..

[18]  Jürgen Pleiss,et al.  Molecular modeling of family GH16 glycoside hydrolases: Potential roles for xyloglucan transglucosylases/hydrolases in cell wall modification in the poaceae , 2004, Protein science : a publication of the Protein Society.

[19]  David L. Wheeler,et al.  GenBank: update , 2004, Nucleic Acids Res..

[20]  I-Min A. Chen,et al.  Advanced Query Mechanisms for Biological Databases , 1998, ISMB.

[21]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[22]  M C Peitsch,et al.  Protein structure computing in the genomic era. , 2000, Research in microbiology.

[23]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[24]  M. Holmquist,et al.  Alpha/Beta-hydrolase fold enzymes: structures, functions and mechanisms. , 2000, Current protein & peptide science.

[25]  Jürgen Pleiss,et al.  The Lipase Engineering Database: a navigation and analysis tool for protein families , 2003, Nucleic Acids Res..

[26]  Robert S. Ledley,et al.  The Protein Information Resource , 2003, Nucleic Acids Res..

[27]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[28]  J. Mcentyre,et al.  Linking up with Entrez. , 1998, Trends in genetics : TIG.

[29]  S J Remington,et al.  The alpha/beta hydrolase fold. , 1992, Protein engineering.

[30]  Vincent Nègre,et al.  New friendly tools for users of ESTHER, the database of the alpha/beta-hydrolase fold superfamily of proteins. , 2005, Chemico-biological interactions.

[31]  K. Giles,et al.  Interactions underlying subunit association in cholinesterases. , 1997, Protein engineering.

[32]  Tao Xu,et al.  Atlas – a data warehouse for integrative bioinformatics , 2005, BMC Bioinformatics.