Informative Motifs in Protein Family Alignments

Consensus and sequence pattern analysis on family alignments are extensively used to identify new family members and to determine functionally and structurally important identities. Since these common approaches emphasize dominant characteristics of the family and assume residue identities are independent at each position, there is no way to describe residue preferences outside of the family consensus. In this study, we propose a novel approach to detect motifs outside the consensus of a protein family alignment via an information theoretic approach. We implemented an algorithm that discovers frequent residue motifs that are high in information content and outside of the family consensus, called informative motifs, inspired by the classic Apriori algorithm. We observed that these informative motifs are mostly spatially localized and present distinctive features of various members of the family.

[1]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[2]  Tao Zhang,et al.  Association Rules , 2000, PAKDD.

[3]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[4]  M. Gribskov,et al.  Profile Analysis , 1970 .

[5]  William C. Ray MAVL and StickWRLD: visually exploring relationships in nucleic acid sequence alignments , 2004, Nucleic Acids Res..

[6]  Frank Höppner,et al.  Association Rules , 2005, Data Mining and Knowledge Discovery Handbook.

[7]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[8]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[9]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[10]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[11]  J J Long,et al.  Cloning and analysis of the C4 photosynthetic NAD-dependent malic enzyme of amaranth mitochondria. , 1994, The Journal of biological chemistry.

[12]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[13]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[15]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[16]  H. Wolfson,et al.  Correlated mutations: Advances and limitations. A study on fusion proteins and on the Cohesin‐Dockerin families , 2006, Proteins.

[17]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[18]  William C. Ray MAVL/StickWRLD for protein: visualizing protein sequence families to detect non-consensus features , 2005, Nucleic Acids Res..

[19]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..