Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations

MOTIVATION As databanks grow, sequence classification and prediction of function by searching protein family databases becomes increasingly valuable. The original Blocks Database, which contains ungapped multiple alignments for families documented in Prosite, can be searched to classify new sequences. However, Prosite is incomplete, and families from other databases are now available to expand coverage of the Blocks Database. RESULTS To take advantage of protein family information present in several existing compilations, we have used five databases to construct Blocks+, a unified database that is built on the PROTOMAT/BLOSUM scoring model and that can be searched using a single algorithm for consistent sequence classification. The LAMA blocks-versus-blocks searching program identifies overlapping protein families, making possible a non-redundant hierarchical compilation. Blocks+ consists of all blocks derived from PROSITE, blocks from Prints not present in PROSITE, blocks from Pfam-A not present in PROSITE or Prints, and so on for ProDom and Domo, for a total of 1995 protein families represented by 8909 blocks, doubling the coverage of the original Blocks Database. A challenge for any procedure aimed at non-redundancy is to retain related but distinct families while discarding those that are duplicates. We illustrate how using multiple compilations can minimize this potential problem by examining the SNF2 family of ATPases, which is detectably similar to distinct families of helicases and ATPases. AVAILABILITY http://blocks.fhcrc.org/

[1]  Jérôme Gracy,et al.  Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment , 1998, Bioinform..

[2]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[3]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Toshio Tsukiyama,et al.  ISWI, a member of the SWl2/SNF2 ATPase family, encodes the 140 kDa subunit of the nucleosome remodeling factor , 1995, Cell.

[5]  Shmuel Pietrokovski,et al.  Recent enhancements to the Blocks Database servers , 1997, Nucleic Acids Res..

[6]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[7]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[8]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[9]  Jérôme Gouzy,et al.  Recent improvements of the ProDom database of protein domain families , 1999, Nucleic Acids Res..

[10]  T. Attwood,et al.  PRINTS--a protein motif fingerprint database. , 1994, Protein engineering.

[11]  D. Auble,et al.  Molecular analysis of the SNF2/SWI2 protein family member MOT1, an ATP-driven enzyme that dissociates TATA-binding protein from DNA , 1997, Molecular and cellular biology.

[12]  Shmuel Pietrokovski,et al.  New features of the Blocks Database servers , 1999, Nucleic Acids Res..

[13]  R J Roberts,et al.  Predictive motifs derived from cytosine methyltransferases. , 1989, Nucleic acids research.

[14]  S. Henikoff,et al.  A helix-turn-helix DNA-binding motif predicted for transposases of DNA transposons , 1997, Molecular and General Genetics MGG.

[15]  Lawrence Hunter,et al.  Mega-Classification: Discovering Motifs in Massive Datastreams , 1992, AAAI.

[16]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[17]  Cathy H. Wu,et al.  ProClass protein family database , 2000, Nucleic Acids Res..

[18]  Robert D. Finn,et al.  Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins , 1999, Nucleic Acids Res..

[19]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[20]  Matthias Mann,et al.  Chromatin-remodelling factor CHRAC contains the ATPases ISWI and topoisomerase II , 1997, Nature.

[21]  R. Sheridan,et al.  A systematic search for protein signature sequences , 1992, Proteins: Structure, Function, and Bioinformatics.

[22]  Terri K. Attwood,et al.  PRINTS prepares for the new millennium , 1999, Nucleic Acids Res..

[23]  P Bork,et al.  An expanding family of helicases within the 'DEAD/H' superfamily. , 1993, Nucleic acids research.

[24]  S. Pietrokovski Searching databases of conserved sequence regions by aligning protein multiple-alignments. , 1996, Nucleic acids research.

[25]  J. Workman,et al.  Stimulation of GAL4 derivative binding to nucleosomal DNA by the yeast SWI/SNF complex. , 1994, Science.

[26]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[27]  D. Lipman,et al.  Extracting protein alignment models from the sequence database. , 1997, Nucleic acids research.

[28]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[29]  S. Henikoff,et al.  Automated construction and graphical presentation of protein blocks from unaligned sequences. , 1995, Gene.

[30]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.