Sensitive pattern discovery with 'fuzzy' alignments of distantly related proteins

MOTIVATION Evolutionary comparison leads to efficient functional characterisation of hypothetical proteins. Here, our goal is to map specific sequence patterns to putative functional classes. The evolutionary signal stands out most clearly in a maximally diverse set of homologues. This diversity, however, leads to a number of technical difficulties. The targeted patterns-as gleaned from structure comparisons-are too sparse for statistically significant signals of sequence similarity and accurate multiple sequence alignment. RESULTS We address this problem by a fuzzy alignment model, which probabilistically assigns residues to structurally equivalent positions (attributes) of the proteins. We then apply multivariate analysis to the 'attributes x proteins' matrix. The dimensionality of the space is reduced using non-negative matrix factorization. The method is general, fully automatic and works without assumptions about pattern density, minimum support, explicit multiple alignments, phylogenetic trees, etc. We demonstrate the discovery of biologically meaningful patterns in an extremely diverse superfamily related to urease.

[1]  C Sander,et al.  An evolutionary treasure: unification of a broad set of amidohydrolases related to urease , 1997, Proteins.

[2]  William Noble Grundy,et al.  Meta-MEME: motif-based hidden Markov models of protein families , 1997, Comput. Appl. Biosci..

[3]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[4]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[5]  Jaak Vilo Discovering Frequent Patterns from Strings , 1998 .

[6]  Michael Lappe,et al.  Accurate detection of very sparse sequence motifs , 2003, RECOMB '03.

[7]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[8]  O. Lichtarge,et al.  Structural clusters of evolutionary trace residues are statistically significant and common in proteins. , 2002, Journal of molecular biology.

[9]  L. Holm,et al.  Unification of protein families. , 1998, Current opinion in structural biology.

[10]  H M Holden,et al.  Molecular structure of dihydroorotase: a paradigm for catalysis through the use of a binuclear metal center. , 2001, Biochemistry.

[11]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[12]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[13]  R J Fletterick,et al.  Biochemical characterization and crystallographic structure of an Escherichia coli protein from the phosphotriesterase gene family. , 1998, Biochemistry.

[14]  Aapo Hyvärinen,et al.  Survey on Independent Component Analysis , 1999 .

[15]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.