Multiple domain protein diagnostic patterns

We have implemented an iterative algorithm for the identification of diagnostic patterns from sets of multiple‐domain proteins, where domains need not be common to all the proteins in the defining set. Our algorithm was applied to sequences gathered using a variety of methods, including BLAST, common keywords, and common E.C. numbers. In all cases, useful diagnostic patterns were obtained, possessing both high sensitivity and specificity. The patterns were found to correlate in several cases with both functional and structural domains. Patterns generated from a large number of sequence families were analyzed for probable multiple‐domain structure.

[1]  Georg E. Schulz,et al.  Principles of Protein Structure , 1979 .

[2]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[3]  W. Taylor,et al.  Identification of protein sequence homology by consensus template alignment. , 1986, Journal of molecular biology.

[4]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[5]  R. Staden Searching for patterns in protein and nucleic acid sequences. , 1990, Methods in enzymology.

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[8]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[9]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[10]  K. Doege,et al.  Complete coding sequence and deduced primary structure of the human cartilage large aggregating proteoglycan, aggrecan. Human-specific repeats, and additional alternatively spliced forms. , 1991, The Journal of biological chemistry.

[11]  Russell F. Doolittle,et al.  Reconstructing history with amino acid sequences 1 , 1992 .

[12]  J. Tainer,et al.  Atomic structure of the DNA repair [4Fe-4S] enzyme endonuclease III. , 1992, Science.

[13]  Smith Rf,et al.  Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparative protein modelling. , 1992 .

[14]  Raman Nambudripad,et al.  The ancient regulatory-protein family of WD-repeat proteins , 1994, Nature.

[15]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank: current status. , 1994, Nucleic acids research.

[16]  A. Bairoch The ENZYME data bank. , 1993, Nucleic acids research.

[17]  R. Fleischmann,et al.  The Minimal Gene Complement of Mycoplasma genitalium , 1995, Science.

[18]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[19]  R. Fleischmann,et al.  Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. , 1995, Nature.

[20]  Robert B. Russell,et al.  Towards an Intelligent System for the Automatic Assignment of Domains in Globular Proteins , 1995, ISMB.

[21]  N. Williams Closing in on the complete yeast genome sequence. , 1995, Science.

[22]  M J Sternberg,et al.  Identification and analysis of domains in proteins. , 1995, Protein engineering.

[23]  Jean-Michel Claverie,et al.  Progress in Large-Scale Sequence Analysis , 1996 .