Identification of Protein Motifs Using Conserved Amino Acid Properties and Partitioning Techniques

Analyzing a set of protein sequences involves a fundamental relationship between the coherency of the set and the specificity of the motif that describes it. Motifs may be obscured by training sets that contain incoherent sequences, in part due to protein subclasses, contamination, or errors. We develop an algorithm for motif identification that systematically explores possible patterns of coherency within a set of protein sequences. Our algorithm constructs alternative partitions of the training set data, where one subset of each partition is presumed to contain coherent data and is used for forming a motif. The motif is represented by multiple overlapping amino acid groups based on evolutionary, biochemical, or physical properties. We demonstrate our method on a training set of reverse transcriptases that contains subclasses, sequence errors, misalignments, and contaminating sequences. Despite these complications, our program identifies a novel motif for the subclass of retroviral and retrovirus-related reverse transcriptases. This motif has a much higher specificity than previously reported motifs and suggests the importance of conserved hydrophilic and hydrophobic residues in the structure of reverse transcriptases.

[1]  Douglas L. Brutlag,et al.  Rapid searches for complex patterns in biological molecules , 1984, Nucleic Acids Res..

[2]  H. Scheraga,et al.  Statistical analysis of the physical properties of the 20 naturally occurring amino acids , 1985 .

[3]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[4]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  Hamilton O. Smith,et al.  Finding sequence motifs in groups of functionally related proteins. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Thomas D. Wu Efficient Diagnosis of Multiple Disorders Based on a Symptom Clustering Approach , 1990, AAAI.

[8]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[9]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[10]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[11]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[12]  M. Gribskov,et al.  Profile Analysis , 1970 .

[13]  M J Sternberg,et al.  Identification of sequence motifs from a set of proteins with related function. , 1994, Protein engineering.

[14]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.