MetaFam: a unified classification of protein families. II. Schema and query capabilities

MOTIVATION Protein sequence and family data is accumulating at such a rapid rate that state-of-the-art databases and interface tools are required to aid curators with their classifications. We have designed such a system, MetaFam, to facilitate the comparison and integration of public protein sequence and family data. This paper presents the global schema, integration issues, and query capabilities of MetaFam. RESULTS MetaFam is an integrated data warehouse of information about protein families and their sequences. This data has been collected into a consistent global schema, and stored in an Oracle relational database. The warehouse implementation allows for quick removal of outdated data sets. In addition to the relational implementation of the primary schema, we have developed several derived tables that enable efficient access from data visualization and exploration tools. Through a series of straightforward SQL queries, we demonstrate the usefulness of this data warehouse for comparing protein family classifications and for functional assignment of new sequences.

[1]  Peter Buneman,et al.  Challenges in Integrating Biological Data Sources , 1995, J. Comput. Biol..

[2]  Peter B. McGarvey,et al.  The Protein Information Resource (PIR) , 2000, Nucleic Acids Res..

[3]  T Etzold,et al.  Using views for retrieving data from extremely heterogeneous databanks. , 1997, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[4]  Jérôme Gouzy,et al.  ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons , 2000, Nucleic Acids Res..

[5]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[6]  Victor Markowitz,et al.  Exploring Heterogeneous Molecular Biology Databases in the Context of the Object-Protocol Model , 1997 .

[7]  Dmitrij Frishman,et al.  Comprehensive, comprehensible, distributed and intelligent databases: current status , 1998, Bioinform..

[8]  Carole A. Goble,et al.  An ontology for bioinformatics applications , 1999, Bioinform..

[9]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[10]  Limsoon Wong,et al.  BioKleisli: a digital library for biomedical researchers , 1997, International Journal on Digital Libraries.

[11]  J. Mcentyre,et al.  Linking up with Entrez. , 1998, Trends in genetics : TIG.

[12]  Patricia C. Babbitt,et al.  Understanding Enzyme Superfamilies , 1997, The Journal of Biological Chemistry.

[13]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[14]  Huajun Wang,et al.  A model system for studying the integration of molecular biology databases , 1998, Bioinform..

[15]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[16]  Carole A. Goble,et al.  TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources , 1998, ISMB.

[17]  Jérôme Gracy,et al.  Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities , 1998, Bioinform..

[18]  T L Blundell,et al.  A database of globular protein structural domains: clustering of representative family members into similar folds. , 1996, Folding & design.

[19]  James E. Johnson,et al.  MetaFam: a unified classification of protein families. I. Overview and statistics , 2001, Bioinform..

[20]  Jérôme Gracy,et al.  Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment , 1998, Bioinform..

[21]  Thure Etzold,et al.  Transforming a set of biological flat file libraries to a fast access network , 1993, Comput. Appl. Biosci..

[22]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[23]  O Ritter,et al.  Prototype implementation of the integrated genomic database. , 1994, Computers and biomedical research, an international journal.

[24]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[25]  Terri K. Attwood,et al.  PRINTS-S: the database formerly known as PRINTS , 2000, Nucleic Acids Res..

[26]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[27]  M. Kanehisa,et al.  DBGET/LinkDB: an integrated database retrieval system. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[28]  Winona C. Barker,et al.  PIR-ALN: a database of protein sequence alignments , 1999, Bioinform..

[29]  Shahrokh Saeednia,et al.  How to maintain both privacy and authentication in digital libraries , 2000 .

[30]  Sándor Pongor,et al.  The SBASE protein domain library, release 7.0: a collection of annotated protein sequence segments , 2000, Nucleic Acids Res..

[31]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[32]  I-Min A Chen,et al.  An Overview of the Object-Protocol Model (OPM) and OPM Data Management Tools , 1995, Inf. Syst..

[33]  Sándor Pongor,et al.  The SBASE protein domain library, Release 4.0: a collection of annotated protein sequence segments , 1993, Nucleic Acids Res..

[34]  Chris Sander,et al.  Protein folds and families: sequence and structure alignments , 1999, Nucleic Acids Res..

[35]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[36]  S. Chung,et al.  Kleisli: a new tool for data integration in biology. , 1999, Trends in biotechnology.

[37]  I-Min A. Chen,et al.  Advanced Query Mechanisms for Biological Databases , 1998, ISMB.

[38]  James E. Bray,et al.  The CATH Database provides insights into protein structure/function relationships , 1999, Nucleic Acids Res..