Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences

UNLABELLED Bioinformatics research often requires conservative analyses of a group of sequences associated with a specific biological function (e.g. transcription factor binding sites, micro RNA target sites or protein post-translational modification sites). Due to the difficulty in exploring conserved motifs on a large-scale sequence data involved with various signals, a new method, MDDLogo, is developed. MDDLogo applies maximal dependence decomposition (MDD) to cluster a group of aligned signal sequences into subgroups containing statistically significant motifs. In order to extract motifs that contain a conserved biochemical property of amino acids in protein sequences, the set of 20 amino acids is further categorized according to their physicochemical properties, e.g. hydrophobicity, charge or molecular size. MDDLogo has been demonstrated to accurately identify the kinase-specific substrate motifs in 1221 human phosphorylation sites associated with seven well-known kinase families from Phospho.ELM. Moreover, in a set of plant phosphorylation data-lacking kinase information, MDDLogo has been applied to help in the investigation of substrate motifs of potential kinases and in the improvement of the identification of plant phosphorylation sites with various substrate specificities. In this study, MDDLogo is comparable with another well-known motif discover tool, Motif-X. CONTACT francis@saturn.yzu.edu.tw

[1]  Dong Xu,et al.  A New Machine Learning Approach for Protein Phosphorylation Site Prediction in Plants , 2009, BICoB.

[2]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[3]  Joachim Selbig,et al.  PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor , 2007, Nucleic Acids Res..

[4]  Wen Huang,et al.  The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant , 2001, Nucleic Acids Res..

[5]  Jorng-Tzong Horng,et al.  RNALogo: a new approach to display structural RNA alignment , 2008, Nucleic Acids Res..

[6]  S. Gygi,et al.  An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets , 2005, Nature Biotechnology.

[7]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[8]  Yu-Yen Ou,et al.  Incorporating significant amino acid pairs to identify O-linked glycosylation sites on transmembrane proteins and non-transmembrane proteins , 2010, BMC Bioinformatics.

[9]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[10]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[11]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[12]  D. Brutlag,et al.  Highly specific protein sequence motifs for genome analysis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Gary D. Stormo,et al.  enoLOGOS: a versatile web tool for energy normalized sequence logos , 2005, Nucleic Acids Res..

[14]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[15]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[16]  Laurie J. Heyer,et al.  Finding the most significant common sequence and structure motifs in a set of RNA sequences. , 1997, Nucleic acids research.

[17]  Vladimir Vacic,et al.  Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments , 2006, Bioinform..

[18]  Eric Beitz,et al.  Subfamily logos: visualization of sequence deviations at alignment positions with high information content , 2006, BMC Bioinformatics.

[19]  Eckart Bindewald,et al.  CorreLogo: an online server for 3D sequence logos of RNA and DNA alignments , 2006, Nucleic Acids Res..

[20]  Hsien-Da Huang,et al.  RegPhos: a system to explore the protein kinase–substrate phosphorylation network in humans , 2010, Nucleic Acids Res..

[21]  J. Herrero,et al.  Kinase peptide specificity: improved determination and relevance to protein phosphorylation. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Jorng-Tzong Horng,et al.  Incorporating hidden Markov models for identifying protein kinase‐specific phosphorylation sites , 2005, J. Comput. Chem..

[23]  T. Hunter,et al.  The Protein Kinase Complement of the Human Genome , 2002, Science.

[24]  Eric C. Rouchka,et al.  Gibbs Recursive Sampler: finding transcription factor binding sites , 2003, Nucleic Acids Res..

[25]  Allegra Via,et al.  Phospho.ELM: a database of phosphorylation sites—update 2008 , 2007, Nucleic Acids Res..

[26]  Jorng-Tzong Horng,et al.  KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites , 2005, Nucleic Acids Res..

[27]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[28]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[29]  P. Cohen,et al.  On target with a new mechanism for the regulation of protein phosphorylation. , 1993, Trends in biochemical sciences.