Extraction of Protein Domains and Signatures through Unsupervised Statistical Sequence Segmentation

We present a novel information theoretic method for protein domain and statistical signature extraction. We apply a new algorithm [20] for unsupervised segmentation of sequences into alternating Variable Memory Markov sources, to families of protein sequences. The algorithm is based on competitive learning between Markov models, implemented as Prediction Suffix Trees [18], which we have shown in earlier works to model well protein families [6, 7]. By clustering the statistical models themselves, based on rate distortion theory combined with deterministic annealing, we obtain a hierarchical segmentation of sequences between alternating Markov sources that seems to automatically avoid over segmentation. This paper demonstrates the potential of this method in protein sequences analysis. We analyze in detail the resulting segmentation achieved for several diverse protein families, and demonstrate the automatic differentiation between different known domains within the sequences. The method also has the potential to pinpoint exact domain boundaries or sub-divide a family into biological sub-families where other computational tools do not.

[1]  J Roca,et al.  The mechanisms of DNA topoisomerases. , 1995, Trends in biochemical sciences.

[2]  Naftali Tishby,et al.  Unsupervised Sequence Segmentation by a Mixture of Switching Variable Memory Markov Sources , 2001, ICML.

[3]  Golan Yona,et al.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[4]  Terri K. Attwood,et al.  PRINTS-S: the database formerly known as PRINTS , 2000, Nucleic Acids Res..

[5]  P. Bork,et al.  Protein sequence motifs. , 1996, Current opinion in structural biology.

[6]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[7]  Peer Bork,et al.  Mobile modules and motifs , 1992, Current Biology.

[8]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[9]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[10]  Golan Yona,et al.  Modeling protein families using probabilistic suffix trees , 1999, RECOMB.

[11]  J A Epstein,et al.  Crystal structure of the human Pax6 paired domain-DNA complex reveals specific roles for the linker region and carboxy-terminal subdomain in DNA binding. , 1999, Genes & development.

[12]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[13]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[14]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[15]  Alberto Apostolico,et al.  Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space , 2000, RECOMB '00.

[16]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[17]  J. Hayes,et al.  The glutathione S-transferase supergene family: regulation of GST and the contribution of the isoenzymes to cancer chemoprotection and drug resistance. , 1995, Critical reviews in biochemistry and molecular biology.

[18]  E T Stuart,et al.  Mammalian Pax genes. , 1994, Annual review of genetics.

[19]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[20]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[21]  Alex Bateman,et al.  The InterPro database, an integrated documentation resource for protein families, domains and functional sites , 2001, Nucleic Acids Res..

[22]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[23]  David L. Eaton,et al.  Glutathione S‐transferases: Amino acid sequence comparison, classification and phylogenetic relationship , 1992 .

[24]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..