Discovering Patterns and Subfamilies in Biosequences

We consider the problem of automatic discovery of patterns and the corresponding subfamilies in a set of biosequences. The sequences are unaligned and may contain noise of unknown level. The patterns are of the type used in PROSITE database. In our approach we discover patterns and the respective subfamilies simultaneously. We develop a theoretically substantiated significance measure for a set of such patterns and an algorithm approximating the best pattern set and the subfamilies. The approach is based on the minimum description length (MDL) principle. We report a computing experiment correctly finding subfamilies in the family of chromo domains and revealing new strong patterns.

[1]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[2]  Ayumi Shinohara,et al.  BONSAI Garden: Parallel Knowledge Discovery System for Amino Acid Sequences , 1995, ISMB.

[3]  Douglas L. Brutlag,et al.  Identification of Protein Motifs Using Conserved Amino Acid Properties and Partitioning Techniques , 1995, ISMB.

[4]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[5]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Heikki Mannila,et al.  MDL learning of unions of simple pattern languages from positive examples , 1995, EuroCOLT.

[7]  Desmond G. Higgins,et al.  Scoring Function for Pattern Discovery Programs Taking Into Account Sequence Diversity , 1996 .

[8]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[9]  M J Sternberg,et al.  Identification of sequence motifs from a set of proteins with related function. , 1994, Protein engineering.

[10]  A. Stewart,et al.  The chromo shadow domain, a second chromo domain in heterochromatin-binding protein 1, HP1. , 1995, Nucleic acids research.

[11]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[12]  M. Nei,et al.  The neighbor-joining method , 1987 .

[13]  Jun S. Liu,et al.  Gibbs motif sampling: Detection of bacterial outer membrane protein repeats , 1995, Protein science : a publication of the Protein Society.

[14]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[15]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[16]  D. Higgins,et al.  Finding flexible patterns in unaligned protein sequences , 1995, Protein science : a publication of the Protein Society.

[17]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[18]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[19]  A. F. Neuwald,et al.  Detecting patterns in protein sequences. , 1994, Journal of molecular biology.

[20]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[21]  Thure Etzold,et al.  SRS - an indexing and retrieval tool for flat file data libraries , 1993, Comput. Appl. Biosci..

[22]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[23]  Inge Jonassen,et al.  Efficient discovery of conserved patterns using a pattern graph , 1997, Comput. Appl. Biosci..

[24]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[25]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[26]  Hamilton O. Smith,et al.  Finding sequence motifs in groups of functionally related proteins. , 1990, Proceedings of the National Academy of Sciences of the United States of America.